Zillow’s housing market predictions, known as Zestimates, are valued for their nationwide coverage and general accuracy. For example, the nationwide median error for off-market homes is 7.5% and for on-market homes is 1.9%1. However, when considering a specific city or region, the accuracy of the Zestimates could be improved by including locally-specific data in the prediction process. This analysis aims to build an accurate and generalizable hedonic model that predicts home prices for Miami by deconstructing overall home price into the value of constituent parts. Accurate models lead to only small differences between the predicted and observed values, and generalizable models accurately predict on new data and with comparable accuracy across various groups. By working to improve the accuracy and generalizability of the predictive model, we are ultimately striving to create a more useful decision-making tool. Such a model may be useful for local governments as they assess property taxes, for example.
To predict housing prices in the Miami area, we train a model using home prices from recent sales. Our model includes “features,” which are variables that are used to predict outcomes (in this case, home price). We use a hedonic model, which breaks home price into components that explain the cost: physical characteristics (e.g., living area), public services and amenities (e.g., distance to parks), and the spatial process of prices (e.g., prices for homes within neighborhoods cluster together). Our process uses data wrangling, exploratory analysis, feature engineering, feature selection, and model estimate and validation to produce a hedonic model that effectively predicts home prices for Miami and Miami Beach. Key challenges inherent to this process involve identifying and cleaning publicly available data for inclusion, investigating underlying spatial processes and trends, and selecting an effective set of features for inclusion in the model while avoiding choosing variables that are too closely correlated with each other.
Our final model for predicting home prices includes the 20 independent variables shown in the Table below. Overall, the indicators used to evaluate the fit of the model (e.g., MAE) show that our model is not predicting well and is therefore not effective for use. The model was able to predict 85% of the variation in price, and most important features were internal property features, including lot size, actual square footage, whether the home is zoned for single family, and whether the home had an elevator or a dock. Generally, we think that our model’s inability to predict expensive homes is the largest source of the error. This poor performance skews our overall MAE. According to the mapped results, we were unable to effectively account for the spatial variation in prices. The model predicted poorly in Miami Beach and along the mainland coast, which is where many of the expensive houses are located. We would recommend that Zillow make further improvements before employing the model for official use. The model has large errors and does not do a good job of predicting home prices. In order to improve this model, we suggest further exploring features that would better account for the spatial processes at play.
finalvars<- c('Lot Size', 'Number of Bedrooms', 'Number of Bathrooms', 'Home Age', 'Home Size', 'Presence of a Pool', 'Whether the Property is a Single Family Home', 'Presence of Luxury Features', 'Presence of an Elevator', 'Presence of a Dock', 'Estate Zoning', 'Number of Sexual Offenders or Predators within an Eighth Mile of the Home', 'Distance to the Nearest Park', 'Within 0.5 miles of a Metro Stop', 'Median Household Income', 'Percent of Population that is Hispanic', 'Percent of Population with a Bachelors Degree', 'Percent of Households that are Owned', 'Percent of Population that Commutes via Car', 'Percent of Households with a Household Income of $200,000 or more')
vartype<-c('Internal Characteristic', 'Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic', 'Amenities/Public Services', 'Amenities/Public Services', 'Amenities/Public Services', 'Spatial Structure', 'Spatial Structure', 'Spatial Structure', 'Spatial Structure', 'Spatial Structure', 'Spatial Structure')
datasource<-c('Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Miami Dade County Open Data', 'OpenStreetMap', 'Miami Dade County Open Data', 'U.S. Census Bureau American Community Survey', 'U.S. Census Bureau American Community Survey', 'U.S. Census Bureau American Community Survey', 'U.S. Census Bureau American Community Survey', 'U.S. Census Bureau American Community Survey', 'U.S. Census Bureau American Community Survey')
varsdesc<- c(
'Total Lot Size in Square Feet',
'Total Number of Bedrooms in the Home',
'Total Number of Bathrooms in the Home',
'Age of the Home, as of 2020',
'Total Size of the Home in Square Feet',
'Whether or Not the Property Includes a Pool',
'Whether the Property is a Single-Family Home or a Multi-Family Home',
'Whether the Home Contains any Luxury Features',
'Whether the Home Contains an Elevator',
'Whether the Home Contains a Dock',
'Whether the Home is Zoned as an Estate',
'The Number of Sexual Offenders and Predators within an Eighth of a Mile of the Home',
'Distance (in feet) to the nearest Park',
'Whether the Home is Within a Half-Mile of a MetroMover or MetroRail Stop',
'Median Household Income in the Census Tract',
'Percent of Population in the Census Tract that is Hispanic',
'Percent of Population in the Census Tract with a Bachelors Degree',
'Percent of Households in the Census Tract that are Owned',
'Percent of Population in the Census Tract that Commutes via Car',
'Percent of Households in the Census Tract with a Household Income of $200,000 or more')
varstable<- data.frame(finalvars, datasource, varsdesc)
varstable%>%
rename(Final_Variables = finalvars,
#Variable_Type = vartype,
Data_Source = datasource,
Variable_Description = varsdesc)%>%
kable(caption="Final Independent Variables Included in Home Price Prediction Model")%>%
pack_rows("Internal Characteristic", 1, 11, label_row_css = "background-color: #25CB10") %>%
pack_rows("Amenities/Public Services", 12,14, label_row_css = "background-color: #faf500")%>%
pack_rows("Spatial Structure", 15, 20, label_row_css = "background-color: #FA7800")%>%
kable_styling("striped", full_width = F)%>%
scroll_box(height = "250px")| Final_Variables | Data_Source | Variable_Description |
|---|---|---|
| Internal Characteristic | ||
| Lot Size | Home Data | Total Lot Size in Square Feet |
| Number of Bedrooms | Home Data | Total Number of Bedrooms in the Home |
| Number of Bathrooms | Home Data | Total Number of Bathrooms in the Home |
| Home Age | Home Data | Age of the Home, as of 2020 |
| Home Size | Home Data | Total Size of the Home in Square Feet |
| Presence of a Pool | Home Data | Whether or Not the Property Includes a Pool |
| Whether the Property is a Single Family Home | Home Data | Whether the Property is a Single-Family Home or a Multi-Family Home |
| Presence of Luxury Features | Home Data | Whether the Home Contains any Luxury Features |
| Presence of an Elevator | Home Data | Whether the Home Contains an Elevator |
| Presence of a Dock | Home Data | Whether the Home Contains a Dock |
| Estate Zoning | Home Data | Whether the Home is Zoned as an Estate |
| Amenities/Public Services | ||
| Number of Sexual Offenders or Predators within an Eighth Mile of the Home | Miami Dade County Open Data | The Number of Sexual Offenders and Predators within an Eighth of a Mile of the Home |
| Distance to the Nearest Park | OpenStreetMap | Distance (in feet) to the nearest Park |
| Within 0.5 miles of a Metro Stop | Miami Dade County Open Data | Whether the Home is Within a Half-Mile of a MetroMover or MetroRail Stop |
| Spatial Structure | ||
| Median Household Income | U.S. Census Bureau American Community Survey | Median Household Income in the Census Tract |
| Percent of Population that is Hispanic | U.S. Census Bureau American Community Survey | Percent of Population in the Census Tract that is Hispanic |
| Percent of Population with a Bachelors Degree | U.S. Census Bureau American Community Survey | Percent of Population in the Census Tract with a Bachelors Degree |
| Percent of Households that are Owned | U.S. Census Bureau American Community Survey | Percent of Households in the Census Tract that are Owned |
| Percent of Population that Commutes via Car | U.S. Census Bureau American Community Survey | Percent of Population in the Census Tract that Commutes via Car |
| Percent of Households with a Household Income of $200,000 or more | U.S. Census Bureau American Community Survey | Percent of Households in the Census Tract with a Household Income of $200,000 or more |
To begin the analysis, this section loads libraries and options, specifies styling options for maps and plots, creates quantile break functions and quantile styling, and loads color palettes. This section also creates the “nearest neighbor” function, which calculates the average nearest neighbor distance from each home to its k nearest objects of interest. This calculation is useful for creating features that describe the relative amount of an amenity around each home. For example, the nearest neighbor function can tell us the average distance from each home to the closest 1, 2, 3, 4, or 5 parks, bars, and crime events. In our exploratory analysis, we determine which nearest neighbor features are most appropriate for inclusion in the model.
Loading Libraries and Options
knitr::opts_chunk$set(
echo = TRUE)
library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)
library(mapview)
library(ggstance)
library(broom.mixed)
library(osmdata)
library(geosphere)
library(tidycensus)
library(stargazer)
library(kableExtra)
library(kableExtra)
library(stargazer)
options(tigris_class = "sf")Loading Themes for Map and Plots
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 13,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "lightskyblue1", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
palette5 <- c("#25CB10", "#5AB60C", "#8FA108", "#C48C04", "#FA7800")
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}Specifying Nearest Neighbor Function
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <- as.matrix(measureFrom)
measureTo_Matrix <- as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}This analysis uses data from a variety of sources: we derived physical characteristics of each house from the underlying home data, gathered public services and amenities information from the Miami-Dade County Open Data Hub and OpenStreetMap, and incorporated population data from the U.S. Census Bureau’s 5-year American Community Survey.
We accessed data for the analysis via the Miami-Dade County Open Data Hub API and using the osmdata package to access OpenStreetMap variables of interest. Most of the data required manipulation and cleaning before being explored. This process included transformations in order to maximize the effectiveness of prediction for each variable. The code below presents the steps in the data gathering process.
text_tabl <-data.frame(
Datasets = c("Miami Home Dataset", "Coastline Data", "Bars, Pubs, & Restaurant Data", "Sexual Offenders Data", "Miami Park Facilities Data","Miami-Dade Transit (MDT) Metromover and Metrorail stations", "2014-2018 ACS 5-year Estimates"),
Source = c("Provided", "OpenStreetMap", "OpenStreetMap", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "U.S. Census Bureau")
)
kbl(text_tabl) %>%
kable_paper(c("hover", "condensed"), full_width = F, html_font = "Montserrat") %>%
kable_styling (bootstrap_options = "striped", "condensed", font_size = 10) %>%
row_spec(0, bold = T, color = "white", background = "dodgerblue", font_size = 14)| Datasets | Source |
|---|---|
| Miami Home Dataset | Provided |
| Coastline Data | OpenStreetMap |
| Bars, Pubs, & Restaurant Data | OpenStreetMap |
| Sexual Offenders Data | Miami-Dade County Open Data Hub |
| Miami Park Facilities Data | Miami-Dade County Open Data Hub |
| Miami-Dade Transit (MDT) Metromover and Metrorail stations | Miami-Dade County Open Data Hub |
| 2014-2018 ACS 5-year Estimates | U.S. Census Bureau |
Loading Basemap
# Loading Miami base map from OSM Data
miami.base <-
st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson") %>%
filter(NAME == "MIAMI BEACH" | NAME == "MIAMI") %>%
st_union()
xmin = st_bbox(miami.base)[[1]]
ymin = st_bbox(miami.base)[[2]]
xmax = st_bbox(miami.base)[[3]]
ymax = st_bbox(miami.base)[[4]]Loading Home Data (Provided) and Neighborhood Data
MiamiProperties<-
st_read("C:/Users/wagne/Documents/GitHub/ParkWagner_MidtermMUSA508/studentsData.geojson")%>%
#st_read("/Users/davidseungleepark/Library/Mobile Documents/com~apple~CloudDocs/Fall 2020/cpln592/ParkWagner_MidtermMUSA508/studentsData.geojson") %>%
mutate(pool = ifelse(str_detect((XF1), "Pool") | str_detect((XF2), "Pool") | str_detect((XF3), "Pool"), "Pool", "No Pool")) %>%
mutate(singlefamily = ifelse(str_detect(Zoning, "SINGLE FAMILY"), "Yes", "No")) %>%
mutate(estates = ifelse(str_detect(Zoning, "ESTATES"), "Yes", "No")) %>%
mutate(dock = ifelse(str_detect((XF1), "Dock") | str_detect((XF2), "Dock") | str_detect((XF3), "Dock"), "Dock", "No Dock")) %>%
mutate(Age = (2020 - EffectiveYearBuilt),0) %>%
mutate(luxury=ifelse(str_detect((XF1), "Luxury") | str_detect((XF2), "Luxury") | str_detect((XF3), "Luxury"), "Yes", "No")) %>%
mutate(elevator=ifelse(str_detect((XF1), "Elevator") | str_detect((XF2), "Elevator") | str_detect((XF3), "Elevator"), "Yes", "No")) %>%
mutate(Gazebo=ifelse(str_detect((XF1), "Gazebo") | str_detect((XF2), "Gazebo") | str_detect((XF3), "Gazebo"), "Yes", "No")) %>%
mutate(BeachView=ifelse(str_detect((Legal1), "BEACH VIEW") | str_detect((Legal2), "BEACH VIEW") | str_detect((Legal3), "BEACH VIEW"), "Yes", "No")) %>%
dplyr::select(-Land, -Year,-Bldg, -Total, -Assessed, -saleDate, -saleType, -saleQual, -County.Senior, -County.LongTermSenior, -County.Other.Exempt, -Owner1, -Owner2,
-Mailing.Address, -Mailing.City, -Mailing.State, -Mailing.Zip, -Mailing.Country, -YearBuilt,
-City.Senior, -City.LongTermSenior, -City.Other.Exempt, -Legal1, -Legal2, -Legal3, -Legal4, -Legal5, -Legal6, -XF1, -XF2, -XF3,
-WVDB, -HEX, -GPAR, -County.2nd.HEX, -City.2nd.HEX, -MillCode, -Zoning, -Land.Use)
st_crs(MiamiProperties)
# Loading elementary school boundaries
elementary.school.boundaries <-
st_read("https://opendata.arcgis.com/datasets/19f5d8dcd9714e6fbd9043ac7a50c6f6_0.geojson") %>%
st_transform('ESRI:102658')
#Neighborhood Data
neighborhood<-
st_read("https://opendata.arcgis.com/datasets/2f54a0cbd67046f2bd100fb735176e6c_0.geojson")%>%
st_transform('ESRI:102658')%>%
dplyr::select(LABEL)%>%
rename(Neighborhood=LABEL)
muni_boundary<-
st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson")%>%
filter(NAME=="MIAMI BEACH")%>%
st_transform('ESRI:102658')%>%
dplyr::select(NAME)%>%
rename(Neighborhood=NAME)
all_nhoods<-
rbind(neighborhood,muni_boundary)
#Miami Beach Neighborhoods
MBAreas<-st_read("https://opendata.arcgis.com/datasets/a21e846f4e8e4d81ad3b75bc4f334516_0.geojson")%>%
filter(FID>1)%>%
filter(FID<130)%>%
filter(LAND=="Y")%>%
st_transform('ESRI:102658')%>%
dplyr::select(FID)%>%
rename(Neighborhood=FID)
miami.base_map<-
miami.base%>%
st_transform('ESRI:102658')
elem_map<-
elementary.school.boundaries%>%
st_crop(miami.base_map)%>%
rename(elem_name=NAME)
all_nhoods_MB<-
rbind(neighborhood,MBAreas)%>%
st_crop(miami.base_map)Calculating Distance to Coastline; Calculating proximity to Bars, Pubs, and Restaurants; Sexual Offenses; Parks; and Public Transportation
## Calculate the distance to Coastline (this calculation has to be in WGS84)
Coastline<-opq(bbox = c(xmin, ymin, xmax, ymax)) %>%
add_osm_feature("natural", "coastline") %>%
osmdata_sf()
### add to MiamiProperties and convert to miles
MiamiProperties <-
MiamiProperties %>%
mutate(CoastDist=(geosphere::dist2Line(p=st_coordinates(st_centroid(MiamiProperties)),
line=st_coordinates(Coastline$osm_lines)[,1:2])*0.00062137)[,1])
##Convert Miami Data to Local Projection #st_transform('ESRI:102658')
MiamiProperties <-
MiamiProperties%>%
st_transform('ESRI:102658')
MiamiProperties<-
MiamiProperties%>%
mutate(milecoast=ifelse(CoastDist<1,"Yes","No"))%>%
mutate(halfmilecoast=ifelse(CoastDist<0.5,"Yes","No"))
## Add Data on Bars, pubs, and restaurants
bars <- opq(bbox = c(xmin, ymin, xmax, ymax)) %>%
add_osm_feature(key = 'amenity', value = c("bar", "pub", "restaurant")) %>%
osmdata_sf()
bars<-
bars$osm_points%>%
.[miami.base,]
bars<-
bars%>%
st_transform('ESRI:102658')
MiamiProperties<-
MiamiProperties %>%
mutate(
bars_nn1= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(bars),1),
bars_nn2= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(bars),2),
bars_nn3= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(bars),3),
bars_nn4= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(bars),4),
bars_nn5= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(bars),5))
MiamiProperties$bars_Buffer =
st_buffer(MiamiProperties, 660) %>%
aggregate(mutate(dplyr::select(bars), counter = 1),., sum) %>%
pull(counter)
MiamiProperties<-
MiamiProperties%>%
mutate(bars_Buffer = replace_na(bars_Buffer, 0))
## Add data on crime- Sexual Offenders and Predators within Miami-Dade County point data
miami.sexualoffenders <-
st_read("https://opendata.arcgis.com/datasets/f8759d722aeb4198bfe7c4ad780604d2_0.geojson") %>%
filter(CITY == "MIAMI" | CITY == "MIAMI BEACH" | CITY == "Miami" | CITY == "Miami Beach") %>%
st_transform('ESRI:102658')
MiamiProperties$sexualoffenders_Buffer =
st_buffer(MiamiProperties, 660) %>%
aggregate(mutate(dplyr::select(miami.sexualoffenders), counter = 1),., sum) %>%
pull(counter)
MiamiProperties<-
MiamiProperties%>%
mutate(sexualoffenders_Buffer = replace_na(sexualoffenders_Buffer, 0))
MiamiProperties <-
MiamiProperties %>%
mutate(
crime_nn1= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(miami.sexualoffenders),1),
crime_nn2= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(miami.sexualoffenders),2),
crime_nn3= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(miami.sexualoffenders),3),
crime_nn4= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(miami.sexualoffenders),4),
crime_nn5= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(miami.sexualoffenders),5))
## Add the data on Park Facilities
Parks<-st_read("https://opendata.arcgis.com/datasets/8c9528d3e1824db3b14ed53188a46291_0.geojson")%>%
st_transform('ESRI:102658')
MiamiProperties<-
MiamiProperties %>%
mutate(
parks_nn1= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(Parks),1),
parks_nn2= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(Parks),2),
parks_nn3= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(Parks),3),
parks_nn4= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(Parks),4),
parks_nn5= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(Parks),5))
MiamiProperties$parks.Buffer =
st_buffer(MiamiProperties, 660) %>%
aggregate(mutate(dplyr::select(Parks), counter = 1),., sum) %>%
pull(counter)
MiamiProperties<-
MiamiProperties%>%
mutate(parks.Buffer = replace_na(parks.Buffer, 0))
#TOD or non-TOD; distance to transit stop?
metrorail_stop<-st_read("https://opendata.arcgis.com/datasets/ee3e2c45427e4c85b751d8ad57dd7b16_0.geojson")%>%
st_transform('ESRI:102658')%>%
dplyr::select(NAME)
metromover_stop<-st_read("https://opendata.arcgis.com/datasets/aec76104165c4e879b9b0203fa436dab_0.geojson")%>%
st_transform('ESRI:102658')%>%
dplyr::select(NAME)
metro_stops<-
rbind(metromover_stop,metrorail_stop)
#Distance to metro_stop (Added a column for the distance to the nearest stop and a column for homes that are within 0.5 miles of a stop)
MiamiProperties<-
MiamiProperties %>%
mutate(
metro_nn1= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(metro_stops),1),
Halfmile_metro=ifelse(metro_nn1<2640,"Halfmile_metro","Non_Halfmile_metro"))Adding Neighborhood and Elementary School Names to Each Home Update this to include the correct centroid function
#Add Elementary School Name to Each Property
elementary.school.clean<-
elementary.school.boundaries%>%
dplyr::select(NAME)%>%
rename(elem_name=NAME)
MiamiProperties <- st_join(st_centroid(MiamiProperties), elementary.school.clean, left = TRUE)
#Add neighborhood name to each property
MiamiProperties <- st_join(st_centroid(MiamiProperties), all_nhoods, left = TRUE)
#Add variable for Miami or Miami Beach
MiamiProperties<-
MiamiProperties%>%
mutate(MiamiBeach=ifelse(Property.City=="Miami Beach","Yes","No"))Loading Census Data
## Load in census data
census_api_key("41e1c0d912341017fa6f36a5da061d3b23de335e", overwrite = TRUE)
selected_vars <- c("B02001_001E", # Estimate!!Total population by race -- ##let's double check that it's okay to use this as long as we justify it
"B02001_002E", # People describing themselves as "white alone"
"B02001_003E", # People describing themselves as "black" or "african-american" alone
"B15001_050E", # Females with bachelors degrees
"B15001_009E", # Males with bachelors degrees
"B19013_001E", # Median HH income
"B25058_001E", # Median rent
"B06012_002E", # Total poverty
"B08301_001E", # People who have means of transportation to work
"B08301_002E", # Total people who commute by car, truck, or van
"B08301_010E", # Total people who commute by public transportation"
"B03002_012E", # Estimate Total Hispanic or Latino by race
"B19326_001E", # Median income in past 12 months (inflation-adjusted)
"B07013_001E", # Total households
"B07013_002E", # Total owner-occupied households
"B07013_003E", # total renter-occupied households
"B25027_001E",
"B25027_010E",
"B25038_002E",
"B25038_003E",
"B25038_004E",
"B25038_005E",
"B19001_017E")
tracts18 <-
get_acs(geography = "tract",
variables = selected_vars,
year=2018,
state=12,
county = 086,
geometry=T,
output="wide") %>%
st_transform('ESRI:102658') %>%
rename(TotalPop = B02001_001E,
Whites = B02001_002E,
Blacks = B02001_003E,
FemaleBachelors = B15001_050E,
MaleBachelors = B15001_009E,
MedHHInc = B19013_001E,
MedRent = B25058_001E,
TotalPoverty = B06012_002E,
TotalCommute = B08301_001E,
CarCommute = B08301_002E,
PubCommute = B08301_010E,
TotalHispanic = B03002_012E,
MedInc = B19326_001E,
TotalHH = B07013_001E,
OwnerHH = B07013_002E,
RenterHH = B07013_003E,
#TotalHH2 = B25027_001E,
HHNoMort = B25027_010E,
Own2017later = B25038_003E,
Own201516 = B25038_004E,
Own201014 = B25038_005E,
HH200k = B19001_017E) %>%
dplyr::select(-NAME, -starts_with("B0"), -starts_with("B1"), -starts_with("B2")) %>%
mutate(pctWhite = (ifelse(TotalPop > 0, Whites / TotalPop,0))*100,
pctBlack = (ifelse(TotalPop > 0, Blacks / TotalPop,0))*100,
pctHis = (ifelse(TotalPop >0, TotalHispanic/TotalPop,0))*100,
pctBlackorHis = (ifelse (TotalPop>0, (Blacks+TotalHispanic)/TotalPop,0)) *100,
pctBachelors = (ifelse(TotalPop > 0, ((FemaleBachelors + MaleBachelors) / TotalPop),0)) *100,
pctPoverty = (ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0))*100,
pctCarCommute = (ifelse(TotalCommute > 0, CarCommute / TotalCommute,0))*100,
pctPubCommute = (ifelse(TotalCommute > 0, PubCommute / TotalCommute,0))*100,
pctOwnerHH = (ifelse(TotalHH > 0, OwnerHH / TotalHH,0))*100,
pctRenterHH = (ifelse(TotalHH > 0, RenterHH / TotalHH,0))*100,
pctNoMortgage = (ifelse(TotalHH > 0, HHNoMort / TotalHH,0))*100,
pctOwnerSince2010 = (ifelse(OwnerHH > 0, ((Own2017later + Own201516 + Own201014) / OwnerHH),0)) *100,
pctHH200kOrMore = (ifelse(TotalHH > 0, (HH200k/ TotalHH),0))*100,
year = "2018") %>%
mutate(MedHHInc = replace_na(MedHHInc, 0),
MedRent = replace_na(MedRent,0),
pctBachelors= replace_na(pctBachelors,0),
pctHis= replace_na(pctHis,0),
pctOwnerHH= replace_na(pctOwnerHH,0),
pctCarCommute= replace_na(pctCarCommute,0),
pctHH200kOrMore= replace_na(pctHH200kOrMore,0)) %>%
dplyr::select(-Whites, -Blacks, -FemaleBachelors, -MaleBachelors, -TotalPoverty, -CarCommute, -PubCommute, -TotalCommute, -TotalHispanic)
#merge the data into the MiamiProperties dataframe
MiamiProperties <-st_join((st_centroid(MiamiProperties)),tracts18, left =TRUE) %>%
mutate(MedHHInc = replace_na(MedHHInc, 0),
MedRent = replace_na(MedRent,0),
pctBachelors= replace_na(pctBachelors,0),
pctHis= replace_na(pctHis,0),
pctOwnerHH= replace_na(pctOwnerHH,0),
pctCarCommute= replace_na(pctCarCommute,0),
pctHH200kOrMore= replace_na(pctHH200kOrMore,0))Completing Feature Engineering and Filtering Out Homes Where Sales Prices Have Been Removed
## create dataframe of homes to predict
MiamiPropertiesPred<-
MiamiProperties %>%
filter(toPredict == 1)
## create dataframe of rest of the homes
MiamiProperties<-
MiamiProperties%>%
filter(toPredict == 0)Present a table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the stargazer package for this.
This is a table of all the engineered features that we considered including in the model.
engin_vars_table <-data.frame(
Variable = c("CoastDist", "milecoast", "bars_nn1 ~ 5", "bars_Buffer", "crime_nn1 ~ 5","sexualoffenders_Buffer", "park_nn1 ~ 5", "parks.Buffer" , "metro_nn1", "Halfmile_metro","pctWhite", "pctBlack", "pctHis", "pctBlackorHis", "pctBachelors","pctPoverty", "pctCarCommute", "pctPubCommute", "pctOwnerHH", "pctRenterHH", "pctNoMortgage", "pctOwnerSince2010", "pctHH200kOrMore", "pool", "singlefamily", "estates", "dock", "Age", "luxury", "elevator", "BeachView"),
Dataset = c("OpenStreetMap", "OpenStreetMap", "OpenStreetMap", "OpenStreetMap", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data"),
Description = c("Distance to coast", "Units within a mile from the coast", "Average distances of 1-5 bars", "Number of bars within half mile from a housing units", "Average distances of 1-5 registered sexual offenders", "Number of sexual offenders within half mile from a housing unit", "Average distances of 1-5 parks", "Number of parks within half mile from the unit", "Distance to closest metrorail station", "Units within half mile from a metrorail stations", "Percentage of White population", "Percentage of Black population", "Percentage of Hispanic populaiton", "Percentage of Black and Hispanic population", "Percentage of population with at least Bachelor's degree", "Percentage of population below the poverty line", "Percentage of population communiting by car", "Percentage of population commuting by public transit", "Percentage of households in owner-occupied units", "Percentage of households in renter-occupied units", "Percentage of households with no mortgage", "Percentage of households who owned the house since 2010 or later", "Percentage of households with income of $200K or more", "Units with a pool", "Single Family zoning units", "Estates Zoning Units", "Units with a dock", "Age of the units", "Units with luxuary amenitiss", "Units with an elevator", "Units with a beach view"),
"Data type" = c("continuous", "categorical", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous", "categorical", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous","continuous", "continuous", "continuous", "continuous", "continuous", "continous", "categorical", "categorical", "categorical", "categorical", "continous", "categorical", "categorical", "categorical"),
"Final model" = c("No", "No", "No", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "No", "Yes", "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No")
)
kbl(engin_vars_table) %>%
kable_paper(c("striped", "hover"), html_font = "Montserrat", full_width = F) %>%
kable_material() %>%
kable_styling (bootstrap_options = "striped", "condensed", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "dodgerblue", font_size = 13)%>%
column_spec(1, bold = T) %>%
column_spec(2, width = "25%") %>%
column_spec(3, width = "40%")| Variable | Dataset | Description | Data.type | Final.model |
|---|---|---|---|---|
| CoastDist | OpenStreetMap | Distance to coast | continuous | No |
| milecoast | OpenStreetMap | Units within a mile from the coast | categorical | No |
| bars_nn1 ~ 5 | OpenStreetMap | Average distances of 1-5 bars | continuous | No |
| bars_Buffer | OpenStreetMap | Number of bars within half mile from a housing units | continuous | No |
| crime_nn1 ~ 5 | Miami-Dade County Open Data Hub | Average distances of 1-5 registered sexual offenders | continuous | No |
| sexualoffenders_Buffer | Miami-Dade County Open Data Hub | Number of sexual offenders within half mile from a housing unit | continuous | Yes |
| park_nn1 ~ 5 | Miami-Dade County Open Data Hub | Average distances of 1-5 parks | continuous | Yes |
| parks.Buffer | Miami-Dade County Open Data Hub | Number of parks within half mile from the unit | continuous | No |
| metro_nn1 | Miami-Dade County Open Data Hub | Distance to closest metrorail station | continuous | No |
| Halfmile_metro | Miami-Dade County Open Data Hub | Units within half mile from a metrorail stations | categorical | Yes |
| pctWhite | Miami-Dade County Open Data Hub | Percentage of White population | continuous | No |
| pctBlack | U.S. Census Bureau | Percentage of Black population | continuous | No |
| pctHis | U.S. Census Bureau | Percentage of Hispanic populaiton | continuous | Yes |
| pctBlackorHis | U.S. Census Bureau | Percentage of Black and Hispanic population | continuous | No |
| pctBachelors | U.S. Census Bureau | Percentage of population with at least Bachelor’s degree | continuous | Yes |
| pctPoverty | U.S. Census Bureau | Percentage of population below the poverty line | continuous | No |
| pctCarCommute | U.S. Census Bureau | Percentage of population communiting by car | continuous | Yes |
| pctPubCommute | U.S. Census Bureau | Percentage of population commuting by public transit | continuous | No |
| pctOwnerHH | U.S. Census Bureau | Percentage of households in owner-occupied units | continuous | Yes |
| pctRenterHH | U.S. Census Bureau | Percentage of households in renter-occupied units | continuous | No |
| pctNoMortgage | U.S. Census Bureau | Percentage of households with no mortgage | continuous | No |
| pctOwnerSince2010 | U.S. Census Bureau | Percentage of households who owned the house since 2010 or later | continuous | No |
| pctHH200kOrMore | U.S. Census Bureau | Percentage of households with income of $200K or more | continous | Yes |
| pool | Provide Miami Housing Data | Units with a pool | categorical | Yes |
| singlefamily | Provide Miami Housing Data | Single Family zoning units | categorical | Yes |
| estates | Provide Miami Housing Data | Estates Zoning Units | categorical | Yes |
| dock | Provide Miami Housing Data | Units with a dock | categorical | Yes |
| Age | Provide Miami Housing Data | Age of the units | continous | Yes |
| luxury | Provide Miami Housing Data | Units with luxuary amenitiss | categorical | Yes |
| elevator | Provide Miami Housing Data | Units with an elevator | categorical | Yes |
| BeachView | Provide Miami Housing Data | Units with a beach view | categorical | No |
As part of the exploratory analysis, we examined correlation to assist in identifying features that may be useful for predicting sales price. A correlation matrix helps us visualize correlation across numeric variables. In the figures, the darker colors imply stronger correlation. These plot helps to determine features that are correlated to sale prices (see sale prices row) and variables that are correlated with each other. The first figure depicts a correlation plot of variables derived from the U.S. Census Bureau. The second figure depicts the correlation between the numeric features included in the final model. We developed several correlation matrices with additional variables in order to help determine features to include in the final model.
#Correlation of Census Variables
<<<<<<< HEAD
censusVars <-
select(st_drop_geometry(MiamiProperties), SalePrice, MedHHInc, MedRent, pctWhite, pctBlack, pctHis, pctBlackorHis,
pctBachelors, pctPoverty, pctCarCommute, pctPubCommute, pctOwnerHH, pctRenterHH, pctNoMortgage, pctOwnerSince2010, pctHH200kOrMore) %>% na.omit()
ggcorrplot(
round(cor(censusVars), 1),
p.mat = cor_pmat(censusVars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation Across Census Variables",
caption="Correlation Plot of Census Variables") #Correlation of Variables Included in Model
reg_final_vars <-
select(st_drop_geometry(MiamiProperties), SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pctOwnerHH, pctCarCommute, pctHH200kOrMore, sexualoffenders_Buffer) %>% na.omit()
ggcorrplot(
round(cor(reg_final_vars), 1),
p.mat = cor_pmat(reg_final_vars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation Across Final Numeric Features",
<<<<<<< HEAD
caption="Correlation Plot of All Numeric Features Included in Final Model") The charts below plot sales price as a function of numeric features. The first figure is an example of a diagnostic tool used to select features for inclusion in the final model. The series of plots show the relationship between sale price and 6 crime variables for sexual offender and predators within the region (5 nearest neighbor variables and one buffer variable that counts the number of offenses within 1/8 mile of each home). By reviewing the slopes of the plotted lines, we selected the crime variable that was most highly correlated with Sale Prices to include in the final model.
The second figure depicts four correlation scatterplots between continuous features and sale prices that suggest correlation. In all cases, the regression line slopes upward from left to right, meaning that on average, as the variable of interest (e.g., home size, median household income, distance to the nearest park, and household income greater than $200,000) increases, so does price. Correlation can also be described by the slope of the line. The greater the slope, the greater the feature’s effect on price.
======= caption="Figure 2. Correlation Plot of All Numeric Features Included in Final Model")The charts below plot sales price as a function of numeric features. Figure 3 is an example of a diagnostic tool used to select features for inclusion in the final model. The series of plots show the relationship between sale price and 6 crime variables for sexual offender and predators within the region (5 nearest neighbor variables and one buffer variable that counts the number of offenses within 1/8 mile of each home). By reviewing the slopes of the plotted lines, we selected the crime variable that was most highly correlated with Sale Prices to include in the final model.
Figure 4 depicts four correlation scatterplots between continuous features and sale prices that suggest correlation. In all cases, the regression line slopes upward from left to right, meaning that on average, as the variable of interest (e.g., home size, median household income, distance to the nearest park, and household income greater than $200,000) increases, so does price. Correlation can also be described by the slope of the line. The greater the slope, the greater the feature’s effect on price.
>>>>>>> 30a48337056764284380a80a0436d5403662727e#Correlation Scatterplots for all NN variable for Crime, as an example:
st_drop_geometry(MiamiProperties) %>%
dplyr::select(SalePrice, crime_nn1, crime_nn2, crime_nn3, crime_nn4, crime_nn5, sexualoffenders_Buffer) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Correlation between Sale Price and Crime Features",
caption="Correlation Scatterplot Depicting the Relationship Between Sale Price and Crime Features") +
plotTheme()+
theme(strip.text.x = element_text(size = 10))#Correlation Scatterplots for Variables of Interest
st_drop_geometry(MiamiProperties) %>%
mutate(Age = 2020 - EffectiveYearBuilt) %>%
dplyr::select(SalePrice, LivingSqFt, parks_nn1, MedHHInc, pctHH200kOrMore) %>%
#filter(SalePrice <= 1000000, Age < 500) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Price as a Function of Continuous Variables",
caption="Correlation between Selected Continuous Variables and Sale Price") +
plotTheme()+
theme(strip.text.x = element_text(size = 10))Some of the features included in the analysis are categorical rather than numeric (see Figure 5). Instead of using slope to calculate correlation, we evaluate the difference in mean price across each category. Differences in mean sales prices across categories suggest that the variable may be a good predictor of Sale Prices.
st_drop_geometry(MiamiProperties) %>%
dplyr::select(SalePrice, pool, elevator, luxury)%>%
gather(Variable,Value, -SalePrice)%>%
ggplot(aes(Value, SalePrice))+
geom_bar(position="dodge",stat="summary", fun.y="mean")+
facet_wrap(~Variable, ncol=3, scales="free")+
labs(title = "Price as a Function of Categorical Variables",
caption="Correlation between Selected Categorical Variables and Average Sale Price") +
plotTheme()The figure below depicts the spatial distribution of sale prices (or price per square foot) across Miami and Miami Beach. The map suggests that prices are clustered within the city, rather than randomly distributed. In other words, homes with a greater price per square foot (orange color) are grouped together (e.g., on the eastern coast of the mainland and within Miami Beach) and homes with a lower price per square foot (green color) are grouped together father inland.
=======Figure 6 depicts the spatial distribution of sale prices (or price per square foot) across Miami and Miami Beach. The map suggests that prices are clustered within the city, rather than randomly distributed. In other words, homes with a greater price per square foot (orange color) are grouped together (e.g., on the eastern coast of the mainland and within Miami Beach) and homes with a lower price per square foot (green color) are grouped together father inland.
>>>>>>> 30a48337056764284380a80a0436d5403662727eMiami.Plot<-
MiamiProperties%>%
mutate(PricePerSq=SalePrice/ActualSqFt)
ggplot()+
geom_sf(data=all_nhoods_MB, fill="grey40")+
geom_sf(data=Miami.Plot, aes(colour=q5(PricePerSq)),
show.legend="point", size=.75)+
scale_colour_manual(values=palette5,
labels=qBr(Miami.Plot, "PricePerSq"),
name="Quintile\nBreaks")+
labs(title="Price Per Square Foot, Miami",
caption="Price per Square Foot for Homes in Miami and Miami Beach") +
mapTheme()The maps below depict independent variables included in the final model. The first map shows the density of sexual offenders and predators in Miami and Miami Beach. The locations with the greatest density of sexual offenders and predators in the northern portions of Miami. The second map shows the locations of single-family and multi-family homes. Many single family homes are clustered along the shorline, with multi-family homes largely located further inland. The third maps shows the locations of County, Municipal, State, and Federal Park Facilities in the area. Parks are fairly evenly distributed throughout the area, with a slightly lesser density of parks further inland.
sexual_offend_map<-
miami.sexualoffenders%>%
st_intersection(all_nhoods)
Parks_map<-
Parks%>%
st_intersection(all_nhoods_MB)
grid.arrange(ncol = 2,
ggplot() + geom_sf(data = all_nhoods_MB, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(sexual_offend_map)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Locations of Sexual Offenders and Predators",
caption="Density of Sexual Offenders in Miami and Miami Beach") +
mapTheme(),
ggplot()+
geom_sf(data=all_nhoods_MB, fill="grey40")+
geom_sf(data=Miami.Plot, aes(colour=singlefamily),
show.legend="point", size=.75)+
labs(title="Single vs. Multi Family Homes",
caption="Figure 9. Single and Multi Family Homes in Miami and Miami Beach") +
mapTheme())ggplot()+
geom_sf(data=all_nhoods_MB, fill="grey40")+
geom_sf(data=Parks_map, color="lightgreen", show.legend="point", size=0.75)+
labs(title="Park Locations",
catption="Locations of Park Facilities in Miami and Miami Beach") +
mapTheme()The maps below depict independent variables included in the final model. Figure 7 depicts the density of sexual offenders and predators in Miami and Miami Beach. The locations with the greatest density of sexual offenders and predators in the northern portions of Miami. Figure 8 shows the locations of County, Municipal, State, and Federal Park Facilities in the area. Parks are fairly evenly distributed throughout the area, with a slightly lesser density of parks further inland. Figure 9 shows the locations of single-family and multi-family homes. Many single family homes are clustered along the shorline, with multi-family homes largely located further inland.
#density of sexual offenders
sexual_offend_map<-
miami.sexualoffenders%>%
st_intersection(all_nhoods)
ggplot() + geom_sf(data = all_nhoods_MB, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(sexual_offend_map)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Locations of Sexual Offenders and Predators",
caption="Figure 7. Density of Sexual Offenders in Miami and Miami Beach") +
mapTheme()#Parks
Parks_map<-
Parks%>%
st_intersection(all_nhoods_MB)
ggplot()+
geom_sf(data=all_nhoods_MB, fill="grey40")+
geom_sf(data=Parks_map, size=1, color="lightgreen")+
labs(title="Park Locations",
catption="Figure 8. Locations of Park Facilities in Miami and Miami Beach") +
mapTheme()#Single Family Homes vs. Multi Family Homes
ggplot()+
geom_sf(data=all_nhoods_MB, fill="grey40")+
geom_sf(data=Miami.Plot, aes(colour=singlefamily),
show.legend="point", size=.75)+
labs(title="Single vs. Multi Family Homes",
caption="Figure 9. Single and Multi Family Homes in Miami and Miami Beach") +
mapTheme()After exploring and testing correlations of all the variables, we have started our model building process by looking at only at physical attributes of the housing units (reg1). Then we added additional engineered variables to see how the model performs (reg2). In reg3, we included almost all variables, even the ones we have determined to have strong collinearity, just to see how the model with too many variables that are collinear would perform.
<<<<<<< HEADAfter our initial exploratory models, we used the stepwise backward method to find the statistically best model. The stepwise backward function drops each individual variable at a time to find a model with the least AIC (Akaike information criterion) which estimates the quality of each model relative to other models. the output of this process, reg4 was the model with the smallest AIC value, indicating this output is the statistically best model.
The variables with crime factors were not part of the output from the stepwise backward function. However, we made a judgment call that crime (proximity or density of sexual offenders as a proxy) is important factor in the predicting the price of a home. So we decided to make a model (reg5) that includes “crime_nn5”, which indicates the average distances of the 5 nearest sexual offenders, and another model (reg6) that includes “sexualoffenders_Buffer”, which indicates the number of sexual offenders in a half mile buffer of a home.
=======## model with physical attributes of the housing
reg1 <- lm(SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>%
dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, elevator, dock, BeachView, pool, luxury, singlefamily, estates))
summary(reg1) %>%
xtable() %>%
kable()| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 504740.78752 | 112369.359446 | 4.4918009 | 0.0000074 |
| LotSize | 48.00983 | 5.175078 | 9.2771209 | 0.0000000 |
| Bed | -177431.45740 | 26364.889763 | -6.7298388 | 0.0000000 |
| Bath | 130667.03461 | 27776.336931 | 4.7042573 | 0.0000027 |
| Age | -1692.83970 | 663.550540 | -2.5511843 | 0.0107923 |
| ActualSqFt | 601.23373 | 21.615105 | 27.8154439 | 0.0000000 |
| elevatorYes | 1459061.48261 | 177745.166004 | 8.2087267 | 0.0000000 |
| dockNo Dock | -938449.77490 | 90502.206638 | -10.3693579 | 0.0000000 |
| BeachViewYes | -87089.15501 | 151595.424267 | -0.5744841 | 0.5656897 |
| poolPool | 1660.53867 | 47763.736934 | 0.0347657 | 0.9722692 |
| luxuryYes | 3943743.59377 | 176617.755093 | 22.3292590 | 0.0000000 |
| singlefamilyYes | -243110.73077 | 34594.601127 | -7.0274182 | 0.0000000 |
| estatesYes | 3432949.24232 | 203207.417448 | 16.8938186 | 0.0000000 |
## add addtional engineered variables
reg2 <- lm(SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>%
dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt,
parks_nn1, MedRent, MedHHInc, pctBlackorHis, pctBachelors, pool,
pctOwnerHH, crime_nn5, milecoast, pctNoMortgage, pctOwnerSince2010, pctCarCommute, pctPubCommute,
pctPoverty, Halfmile_metro, CoastDist, pctHH200kOrMore, elevator, dock, luxury, BeachView))
summary(reg2)##
## Call:
## lm(formula = SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>%
## dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt,
## parks_nn1, MedRent, MedHHInc, pctBlackorHis, pctBachelors,
## pool, pctOwnerHH, crime_nn5, milecoast, pctNoMortgage,
## pctOwnerSince2010, pctCarCommute, pctPubCommute, pctPoverty,
## Halfmile_metro, CoastDist, pctHH200kOrMore, elevator,
## dock, luxury, BeachView))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6150950 -243991 23992 269054 11668506
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1732717.954 404062.943 4.288
## LotSize 82.986 5.127 16.187
## Bed -200237.241 27312.324 -7.331
## Bath 121706.869 28997.633 4.197
## Age -1544.741 685.881 -2.252
## ActualSqFt 642.300 22.281 28.827
## parks_nn1 51.400 21.364 2.406
## MedRent 493.947 52.471 9.414
## MedHHInc -7.827 1.451 -5.394
## pctBlackorHis 1473.638 3050.742 0.483
## pctBachelors -137797.338 34187.925 -4.031
## poolPool -279862.138 52041.332 -5.378
## pctOwnerHH 4370.392 2420.360 1.806
## crime_nn5 -31.101 14.112 -2.204
## milecoastYes -45641.078 64424.121 -0.708
## pctNoMortgage -2867.908 9708.868 -0.295
## pctOwnerSince2010 -5777.683 4458.552 -1.296
## pctCarCommute -23557.440 3998.039 -5.892
## pctPubCommute -11823.922 6115.165 -1.934
## pctPoverty 6090.940 2810.809 2.167
## Halfmile_metroNon_Halfmile_metro -42914.658 55350.634 -0.775
## CoastDist 8803.453 25191.572 0.349
## pctHH200kOrMore 54814.177 13328.198 4.113
## elevatorYes 1976762.846 184267.710 10.728
## dockNo Dock -1024027.443 94386.813 -10.849
## luxuryYes 3914685.887 181844.548 21.528
## BeachViewYes 196961.711 172970.845 1.139
## Pr(>|t|)
## (Intercept) 0.000018666680324 ***
## LotSize < 0.0000000000000002 ***
## Bed 0.000000000000303 ***
## Bath 0.000027939447231 ***
## Age 0.0244 *
## ActualSqFt < 0.0000000000000002 ***
## parks_nn1 0.0162 *
## MedRent < 0.0000000000000002 ***
## MedHHInc 0.000000075011910 ***
## pctBlackorHis 0.6291
## pctBachelors 0.000057237164173 ***
## poolPool 0.000000082166985 ***
## pctOwnerHH 0.0711 .
## crime_nn5 0.0276 *
## milecoastYes 0.4787
## pctNoMortgage 0.7677
## pctOwnerSince2010 0.1951
## pctCarCommute 0.000000004301860 ***
## pctPubCommute 0.0533 .
## pctPoverty 0.0303 *
## Halfmile_metroNon_Halfmile_metro 0.4382
## CoastDist 0.7268
## pctHH200kOrMore 0.000040329335504 ***
## elevatorYes < 0.0000000000000002 ***
## dockNo Dock < 0.0000000000000002 ***
## luxuryYes < 0.0000000000000002 ***
## BeachViewYes 0.2549
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 829000 on 2599 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8369, Adjusted R-squared: 0.8352
## F-statistic: 512.8 on 26 and 2599 DF, p-value: < 0.00000000000000022
## include almost all variables
reg3 <- lm(SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>%
dplyr::select(SalePrice, LotSize, Bed, Bath, Age,
ActualSqFt, parks_nn1, MedRent, MedHHInc, pctWhite, pctBlack, pctHis, pctBlackorHis, pctBachelors, pool,
singlefamily, pctOwnerHH, pctRenterHH, crime_nn5, milecoast, pctNoMortgage, pctOwnerSince2010, pctCarCommute,
pctPubCommute, pctPoverty, Halfmile_metro, metro_nn1, sexualoffenders_Buffer, milecoast, CoastDist, pctHH200kOrMore))
summary(reg3)##
## Call:
## lm(formula = SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>%
## dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt,
## parks_nn1, MedRent, MedHHInc, pctWhite, pctBlack, pctHis,
## pctBlackorHis, pctBachelors, pool, singlefamily, pctOwnerHH,
## pctRenterHH, crime_nn5, milecoast, pctNoMortgage, pctOwnerSince2010,
## pctCarCommute, pctPubCommute, pctPoverty, Halfmile_metro,
## metro_nn1, sexualoffenders_Buffer, milecoast, CoastDist,
## pctHH200kOrMore))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4169482 -349561 26271 363778 13967127
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) -1199084.087 598149.608 -2.005
## LotSize 69.292 5.599 12.377
## Bed -311230.112 29612.849 -10.510
## Bath 149003.993 31812.026 4.684
## Age -3078.212 757.994 -4.061
## ActualSqFt 907.582 20.603 44.051
## parks_nn1 64.165 24.291 2.641
## MedRent 521.641 57.422 9.084
## MedHHInc -9.410 1.528 -6.158
## pctWhite 12467.684 4065.926 3.066
## pctBlack 22126.357 6677.178 3.314
## pctHis 8266.271 3931.581 2.103
## pctBlackorHis NA NA NA
## pctBachelors -139740.115 37923.627 -3.685
## poolPool -305848.900 57455.036 -5.323
## singlefamilyYes -548427.757 47211.072 -11.617
## pctOwnerHH 9791.028 2839.093 3.449
## pctRenterHH NA NA NA
## crime_nn5 3.317 16.844 0.197
## milecoastYes -61817.681 73425.550 -0.842
## pctNoMortgage 2464.264 10851.585 0.227
## pctOwnerSince2010 4609.115 5023.466 0.918
## pctCarCommute -23196.011 4590.258 -5.053
## pctPubCommute -7224.647 7335.817 -0.985
## pctPoverty 157.046 3425.889 0.046
## Halfmile_metroNon_Halfmile_metro -78563.665 66698.296 -1.178
## metro_nn1 11.786 4.101 2.874
## sexualoffenders_Buffer -20107.978 25869.854 -0.777
## CoastDist -28645.246 32654.659 -0.877
## pctHH200kOrMore 59847.631 15142.705 3.952
## Pr(>|t|)
## (Intercept) 0.045103 *
## LotSize < 0.0000000000000002 ***
## Bed < 0.0000000000000002 ***
## Bath 0.000002959711 ***
## Age 0.000050309473 ***
## ActualSqFt < 0.0000000000000002 ***
## parks_nn1 0.008304 **
## MedRent < 0.0000000000000002 ***
## MedHHInc 0.000000000851 ***
## pctWhite 0.002189 **
## pctBlack 0.000933 ***
## pctHis 0.035603 *
## pctBlackorHis NA
## pctBachelors 0.000234 ***
## poolPool 0.000000110632 ***
## singlefamilyYes < 0.0000000000000002 ***
## pctOwnerHH 0.000572 ***
## pctRenterHH NA
## crime_nn5 0.843888
## milecoastYes 0.399916
## pctNoMortgage 0.820373
## pctOwnerSince2010 0.358957
## pctCarCommute 0.000000464374 ***
## pctPubCommute 0.324792
## pctPoverty 0.963440
## Halfmile_metroNon_Halfmile_metro 0.238946
## metro_nn1 0.004085 **
## sexualoffenders_Buffer 0.437068
## CoastDist 0.380449
## pctHH200kOrMore 0.000079487259 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 910100 on 2598 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.8014
## F-statistic: 393.4 on 27 and 2598 DF, p-value: < 0.00000000000000022
After our initial exploratory models, we used the stepwise backward method to find the statistically best model. The stepwise backward function drops each individual variable at a time to find a model with the least AIC (Akaike information criterion) which estimates the quality of each model relative to other models. the output of this process, reg4 was the model with the smallest AIC value, indicating this output is the statistically best model.
## Start: AIC=71250.08
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 +
## MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBlackorHis +
## pctBachelors + pool + singlefamily + pctOwnerHH + pctRenterHH +
## crime_nn5 + milecoast + pctNoMortgage + pctOwnerSince2010 +
## pctCarCommute + pctPubCommute + pctPoverty + Halfmile_metro +
## metro_nn1 + sexualoffenders_Buffer + CoastDist + pctHH200kOrMore +
## luxury + elevator + dock + estates
##
##
## Step: AIC=71250.08
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 +
## MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBlackorHis +
## pctBachelors + pool + singlefamily + pctOwnerHH + crime_nn5 +
## milecoast + pctNoMortgage + pctOwnerSince2010 + pctCarCommute +
## pctPubCommute + pctPoverty + Halfmile_metro + metro_nn1 +
## sexualoffenders_Buffer + CoastDist + pctHH200kOrMore + luxury +
## elevator + dock + estates
##
##
## Step: AIC=71250.08
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 +
## MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors +
## pool + singlefamily + pctOwnerHH + crime_nn5 + milecoast +
## pctNoMortgage + pctOwnerSince2010 + pctCarCommute + pctPubCommute +
## pctPoverty + Halfmile_metro + metro_nn1 + sexualoffenders_Buffer +
## CoastDist + pctHH200kOrMore + luxury + elevator + dock +
## estates
##
## Df Sum of Sq RSS AIC
## - milecoast 1 34976861590 1556818908245101 71248
## - Halfmile_metro 1 178773872824 1556962705256336 71248
## - pctOwnerSince2010 1 376855090991 1557160786474503 71249
## - pctNoMortgage 1 474802046491 1557258733430002 71249
## - CoastDist 1 762916633576 1557546848017088 71249
## - pctPoverty 1 877484362284 1557661415745796 71250
## <none> 1556783931383512 71250
## - sexualoffenders_Buffer 1 1407645809325 1558191577192837 71250
## - crime_nn5 1 1410989790866 1558194921174378 71250
## - pctPubCommute 1 1809413695690 1558593345079202 71251
## - metro_nn1 1 3312886742800 1560096818126312 71254
## - pool 1 3958649099013 1560742580482525 71255
## - parks_nn1 1 4279260338173 1561063191721685 71255
## - Age 1 4347781940920 1561131713324432 71255
## - pctOwnerHH 1 4428324761288 1561212256144800 71256
## - pctBachelors 1 4535874751690 1561319806135202 71256
## - pctHis 1 5260656535142 1562044587918653 71257
## - pctWhite 1 5674102784626 1562458034168137 71258
## - Bath 1 8711401630863 1565495333014374 71263
## - pctBlack 1 8727987892154 1565511919275666 71263
## - pctCarCommute 1 15887481361670 1572671412745182 71275
## - MedHHInc 1 19151119776026 1575935051159538 71280
## - pctHH200kOrMore 1 20512870439495 1577296801823007 71282
## - Bed 1 22434693350667 1579218624734178 71286
## - singlefamily 1 31992502942841 1588776434326353 71302
## - MedRent 1 52232829388834 1609016760772346 71335
## - elevator 1 54925294938212 1611709226321724 71339
## - LotSize 1 64988335959589 1621772267343100 71355
## - dock 1 68821456703640 1625605388087152 71362
## - estates 1 160065650805404 1716849582188916 71505
## - luxury 1 312202401329569 1868986332713081 71728
## - ActualSqFt 1 421556392811348 1978340324194859 71877
##
## Step: AIC=71248.14
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 +
## MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors +
## pool + singlefamily + pctOwnerHH + crime_nn5 + pctNoMortgage +
## pctOwnerSince2010 + pctCarCommute + pctPubCommute + pctPoverty +
## Halfmile_metro + metro_nn1 + sexualoffenders_Buffer + CoastDist +
## pctHH200kOrMore + luxury + elevator + dock + estates
##
## Df Sum of Sq RSS AIC
## - Halfmile_metro 1 171438575196 1556990346820297 71246
## - pctOwnerSince2010 1 352904557827 1557171812802928 71247
## - pctNoMortgage 1 502137747760 1557321045992861 71247
## - pctPoverty 1 896137654646 1557715045899748 71248
## - CoastDist 1 1047793082086 1557866701327187 71248
## <none> 1556818908245101 71248
## - sexualoffenders_Buffer 1 1391887395928 1558210795641029 71248
## - crime_nn5 1 1461460944292 1558280369189394 71249
## - pctPubCommute 1 1845581829534 1558664490074635 71249
## - metro_nn1 1 3299923530050 1560118831775151 71252
## - pool 1 3946651824479 1560765560069580 71253
## - Age 1 4364251192178 1561183159437280 71253
## - parks_nn1 1 4367011227828 1561185919472930 71253
## - pctOwnerHH 1 4442065582406 1561260973827508 71254
## - pctBachelors 1 4521012012513 1561339920257614 71254
## - pctHis 1 5236458664406 1562055366909508 71255
## - pctWhite 1 5696837558700 1562515745803801 71256
## - Bath 1 8710284937892 1565529193182993 71261
## - pctBlack 1 8711864841164 1565530773086265 71261
## - pctCarCommute 1 15910805526417 1572729713771518 71273
## - MedHHInc 1 19460549195774 1576279457440874 71279
## - pctHH200kOrMore 1 20568246267674 1577387154512774 71281
## - Bed 1 22424636843521 1579243545088622 71284
## - singlefamily 1 32126353187000 1588945261432101 71300
## - MedRent 1 52451061395588 1609269969640690 71333
## - elevator 1 54957559139929 1611776467385030 71337
## - LotSize 1 65109953684523 1621928861929624 71354
## - dock 1 68890017943056 1625708926188157 71360
## - estates 1 160174267546179 1716993175791280 71503
## - luxury 1 312311046910239 1869129955155340 71726
## - ActualSqFt 1 421525593454794 1978344501699895 71875
##
## Step: AIC=71246.43
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 +
## MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors +
## pool + singlefamily + pctOwnerHH + crime_nn5 + pctNoMortgage +
## pctOwnerSince2010 + pctCarCommute + pctPubCommute + pctPoverty +
## metro_nn1 + sexualoffenders_Buffer + CoastDist + pctHH200kOrMore +
## luxury + elevator + dock + estates
##
## Df Sum of Sq RSS AIC
## - pctOwnerSince2010 1 378764955540 1557369111775838 71245
## - pctNoMortgage 1 485598132482 1557475944952780 71245
## - pctPoverty 1 885909887388 1557876256707686 71246
## - CoastDist 1 969031534572 1557959378354869 71246
## <none> 1556990346820297 71246
## - sexualoffenders_Buffer 1 1424094846201 1558414441666498 71247
## - crime_nn5 1 1462837659448 1558453184479746 71247
## - pctPubCommute 1 1726608535822 1558716955356119 71247
## - metro_nn1 1 3242295720872 1560232642541170 71250
## - pool 1 3943045137196 1560933391957493 71251
## - Age 1 4262142912155 1561252489732452 71252
## - pctBachelors 1 4444216874477 1561434563694774 71252
## - pctOwnerHH 1 4517585713894 1561507932534191 71252
## - parks_nn1 1 4584285963053 1561574632783350 71252
## - pctHis 1 5066126328786 1562056473149083 71253
## - pctWhite 1 5525484579872 1562515831400169 71254
## - pctBlack 1 8633531626036 1565623878446333 71259
## - Bath 1 8720878730640 1565711225550937 71259
## - pctCarCommute 1 15744932373084 1572735279193382 71271
## - MedHHInc 1 19468953809268 1576459300629566 71277
## - pctHH200kOrMore 1 20423674200627 1577414021020924 71279
## - Bed 1 22335483388045 1579325830208342 71282
## - singlefamily 1 32423848819469 1589414195639766 71299
## - MedRent 1 52321860769394 1609312207589692 71331
## - elevator 1 54920309218976 1611910656039274 71335
## - LotSize 1 64942650068849 1621932996889146 71352
## - dock 1 69069471452323 1626059818272620 71358
## - estates 1 160502664435973 1717493011256270 71502
## - luxury 1 312574197736371 1869564544556668 71725
## - ActualSqFt 1 421468947430111 1978459294250408 71874
##
## Step: AIC=71245.07
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 +
## MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors +
## pool + singlefamily + pctOwnerHH + crime_nn5 + pctNoMortgage +
## pctCarCommute + pctPubCommute + pctPoverty + metro_nn1 +
## sexualoffenders_Buffer + CoastDist + pctHH200kOrMore + luxury +
## elevator + dock + estates
##
## Df Sum of Sq RSS AIC
## - pctPoverty 1 711226699202 1558080338475040 71244
## - CoastDist 1 801404496108 1558170516271946 71244
## - pctNoMortgage 1 840615219640 1558209726995478 71244
## <none> 1557369111775838 71245
## - sexualoffenders_Buffer 1 1395672314732 1558764784090569 71245
## - pctPubCommute 1 1448349259902 1558817461035739 71246
## - crime_nn5 1 1659792957243 1559028904733081 71246
## - metro_nn1 1 2925406784097 1560294518559934 71248
## - pool 1 4006745954864 1561375857730702 71250
## - pctBachelors 1 4065452999328 1561434564775166 71250
## - pctOwnerHH 1 4223036412996 1561592148188833 71250
## - Age 1 4305984334683 1561675096110520 71250
## - parks_nn1 1 4615673217101 1561984784992938 71251
## - pctHis 1 4722748347022 1562091860122860 71251
## - pctWhite 1 5231407232715 1562600519008552 71252
## - pctBlack 1 8254834884323 1565623946660160 71257
## - Bath 1 8774742378387 1566143854154224 71258
## - pctCarCommute 1 15381902470403 1572751014246241 71269
## - MedHHInc 1 19204811757066 1576573923532904 71275
## - pctHH200kOrMore 1 21537953892198 1578907065668036 71279
## - Bed 1 22538786390294 1579907898166131 71281
## - singlefamily 1 32047037325886 1589416149101724 71297
## - MedRent 1 52541564862424 1609910676638261 71330
## - elevator 1 55226329020642 1612595440796479 71335
## - LotSize 1 64911611882944 1622280723658781 71350
## - dock 1 69438058459428 1626807170235265 71358
## - estates 1 160191885102324 1717560996878162 71500
## - luxury 1 312651640114108 1870020751889945 71724
## - ActualSqFt 1 421944482172130 1979313593947967 71873
##
## Step: AIC=71244.27
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 +
## MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors +
## pool + singlefamily + pctOwnerHH + crime_nn5 + pctNoMortgage +
## pctCarCommute + pctPubCommute + metro_nn1 + sexualoffenders_Buffer +
## CoastDist + pctHH200kOrMore + luxury + elevator + dock +
## estates
##
## Df Sum of Sq RSS AIC
## - CoastDist 1 537174358025 1558617512833065 71243
## - pctNoMortgage 1 672704187898 1558753042662938 71243
## <none> 1558080338475040 71244
## - sexualoffenders_Buffer 1 1189166804381 1559269505279421 71244
## - pctPubCommute 1 1198337218813 1559278675693853 71244
## - crime_nn5 1 1637613748592 1559717952223632 71245
## - metro_nn1 1 2561379485218 1560641717960258 71247
## - pctOwnerHH 1 3840556113843 1561920894588883 71249
## - pool 1 3858128161825 1561938466636865 71249
## - Age 1 4255023330357 1562335361805397 71249
## - pctHis 1 4720073435264 1562800411910304 71250
## - pctBachelors 1 4843767871621 1562924106346661 71250
## - parks_nn1 1 5139305192697 1563219643667737 71251
## - pctWhite 1 5238910278604 1563319248753644 71251
## - pctBlack 1 8608217010056 1566688555485096 71257
## - Bath 1 8842567737520 1566922906212560 71257
## - pctCarCommute 1 15688324178424 1573768662653464 71269
## - MedHHInc 1 20098684779764 1578179023254804 71276
## - pctHH200kOrMore 1 22510542707686 1580590881182726 71280
## - Bed 1 22821004167456 1580901342642496 71280
## - singlefamily 1 32672633203812 1590752971678852 71297
## - MedRent 1 52053460084419 1610133798559459 71329
## - elevator 1 54955734053303 1613036072528343 71333
## - LotSize 1 65281712162320 1623362050637360 71350
## - dock 1 69007494420740 1627087832895780 71356
## - estates 1 159867325570649 1717947664045689 71499
## - luxury 1 312312089709836 1870392428184876 71722
## - ActualSqFt 1 423396975538696 1981477314013736 71874
##
## Step: AIC=71243.18
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 +
## MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors +
## pool + singlefamily + pctOwnerHH + crime_nn5 + pctNoMortgage +
## pctCarCommute + pctPubCommute + metro_nn1 + sexualoffenders_Buffer +
## pctHH200kOrMore + luxury + elevator + dock + estates
##
## Df Sum of Sq RSS AIC
## - pctNoMortgage 1 831913368312 1559449426201377 71243
## - pctPubCommute 1 1166005710260 1559783518543325 71243
## <none> 1558617512833065 71243
## - sexualoffenders_Buffer 1 1345449996428 1559962962829494 71243
## - crime_nn5 1 1686724412312 1560304237245377 71244
## - metro_nn1 1 2028394469436 1560645907302502 71245
## - pctOwnerHH 1 3304221325672 1561921734158737 71247
## - pool 1 3800932506688 1562418445339753 71248
## - Age 1 4194005638651 1562811518471716 71248
## - pctHis 1 4223563088938 1562841075922003 71248
## - pctWhite 1 4722087591294 1563339600424360 71249
## - parks_nn1 1 5012769143588 1563630281976654 71250
## - pctBachelors 1 5303478886028 1563920991719093 71250
## - pctBlack 1 8680645196341 1567298158029406 71256
## - Bath 1 8982099729198 1567599612562264 71256
## - pctCarCommute 1 15249361774748 1573866874607814 71267
## - MedHHInc 1 19675114444262 1578292627277328 71274
## - pctHH200kOrMore 1 21973399333651 1580590912166716 71278
## - Bed 1 23190533180545 1581808046013610 71280
## - singlefamily 1 32156628247600 1590774141080665 71295
## - MedRent 1 51706646403425 1610324159236490 71327
## - elevator 1 54755814703600 1613373327536665 71332
## - LotSize 1 65499920798491 1624117433631556 71349
## - dock 1 69516069599367 1628133582432432 71356
## - estates 1 159433161957169 1718050674790234 71497
## - luxury 1 312035675887201 1870653188720266 71720
## - ActualSqFt 1 425764199164783 1984381711997848 71875
##
## Step: AIC=71242.58
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 +
## MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors +
## pool + singlefamily + pctOwnerHH + crime_nn5 + pctCarCommute +
## pctPubCommute + metro_nn1 + sexualoffenders_Buffer + pctHH200kOrMore +
## luxury + elevator + dock + estates
##
## Df Sum of Sq RSS AIC
## <none> 1559449426201376 71243
## - sexualoffenders_Buffer 1 1253798137015 1560703224338392 71243
## - pctPubCommute 1 1365841858245 1560815268059622 71243
## - crime_nn5 1 1973636174701 1561423062376078 71244
## - metro_nn1 1 1979348929748 1561428775131124 71244
## - pctHis 1 3575637622773 1563025063824150 71247
## - pool 1 3711001497375 1563160427698752 71247
## - Age 1 4337709941065 1563787136142441 71248
## - pctWhite 1 4780517806188 1564229944007564 71249
## - parks_nn1 1 5424732183716 1564874158385092 71250
## - pctBachelors 1 7052364537540 1566501790738916 71252
## - pctBlack 1 8030104388766 1567479530590143 71254
## - pctOwnerHH 1 8236182872165 1567685609073542 71254
## - Bath 1 9177413308810 1568626839510186 71256
## - pctCarCommute 1 15864869875371 1575314296076748 71267
## - pctHH200kOrMore 1 22449536070059 1581898962271436 71278
## - Bed 1 23546167586048 1582995593787425 71280
## - MedHHInc 1 23673147512708 1583122573714084 71280
## - singlefamily 1 31324945385453 1590774371586830 71293
## - MedRent 1 50970218792512 1610419644993889 71325
## - elevator 1 54952098698386 1614401524899763 71332
## - LotSize 1 66453917166717 1625903343368093 71350
## - dock 1 69300151765953 1628749577967330 71355
## - estates 1 159019305847057 1718468732048433 71496
## - luxury 1 311803649136820 1871253075338196 71719
## - ActualSqFt 1 425711244001338 1985160670202715 71874
##
## Call:
## lm(formula = SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt +
## parks_nn1 + MedRent + MedHHInc + pctWhite + pctBlack + pctHis +
## pctBachelors + pool + singlefamily + pctOwnerHH + crime_nn5 +
## pctCarCommute + pctPubCommute + metro_nn1 + sexualoffenders_Buffer +
## pctHH200kOrMore + luxury + elevator + dock + estates, data = st_drop_geometry(MiamiProperties) %>%
## dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt,
## parks_nn1, MedRent, MedHHInc, pctWhite, pctBlack, pctHis,
## pctBlackorHis, pctBachelors, pool, singlefamily, pctOwnerHH,
## pctRenterHH, crime_nn5, milecoast, pctNoMortgage, pctOwnerSince2010,
## pctCarCommute, pctPubCommute, pctPoverty, Halfmile_metro,
## metro_nn1, sexualoffenders_Buffer, milecoast, CoastDist,
## pctHH200kOrMore, luxury, elevator, dock, estates))
##
## Coefficients:
## (Intercept) LotSize Bed
## 276011.040 53.187 -160779.522
## Bath Age ActualSqFt
## 105986.983 -1732.149 569.133
## parks_nn1 MedRent MedHHInc
## 61.026 433.723 -7.685
## pctWhite pctBlack pctHis
## 8704.458 16943.994 6924.188
## pctBachelors poolPool singlefamilyYes
## -99693.043 -123151.714 -289647.271
## pctOwnerHH crime_nn5 pctCarCommute
## 5972.884 -25.664 -19528.543
## pctPubCommute metro_nn1 sexualoffenders_Buffer
## -8940.123 5.056 -31486.780
## pctHH200kOrMore luxuryYes elevatorYes
## 75995.341 3887086.047 1654400.949
## dockNo Dock estatesYes
## -978048.563 3274924.331
The variables with crime factors were not part of the output from the stepwise backward function. However, we made a judgment call that crime (proximity or density of sexual offenders as a proxy) is important factor in the predicting the price of a home. So we decided to try adding “crime_nn5”, which indicates the average distances of the 5 nearest sexual offenders, and “sexualoffenders_Buffer”, which indicates the number of sexual offenders in a half mile buffer of a home.
>>>>>>> 30a48337056764284380a80a0436d5403662727e| Dependent variable: | ||||||
| SalePrice | ||||||
| (1) | (2) | (3) | (4) | (5) | (6) | |
| LotSize | 48.010*** | 82.986*** | 69.292*** | 53.669*** | 53.740*** | 47.937*** |
| (5.175) | (5.127) | (5.599) | (5.054) | (5.065) | (5.112) | |
| Bed | -177,431.500*** | -200,237.200*** | -311,230.100*** | -160,559.900*** | -155,013.500*** | -162,279.300*** |
| (26,364.890) | (27,312.320) | (29,612.850) | (25,680.120) | (25,599.310) | (26,045.800) | |
| Bath | 130,667.000*** | 121,706.900*** | 149,004.000*** | 106,146.400*** | 104,343.500*** | 105,904.800*** |
| (27,776.340) | (28,997.630) | (31,812.030) | (27,098.590) | (27,110.510) | (27,458.130) | |
| Age | -1,692.840** | -1,544.741** | -3,078.212*** | -1,675.232*** | -1,706.347*** | -1,724.194*** |
| (663.551) | (685.881) | (757.994) | (643.991) | (642.513) | (655.959) | |
| ActualSqFt | 601.234*** | 642.300*** | 907.582*** | 566.981*** | 566.367*** | 567.197*** |
| (21.615) | (22.281) | (20.603) | (21.352) | (21.341) | (21.645) | |
| parks_nn1 | 51.400** | 64.165*** | 57.585*** | 67.923*** | 112.126*** | |
| (21.364) | (24.291) | (20.369) | (19.549) | (19.696) | ||
| MedRent | 493.947*** | 521.641*** | 456.687*** | 441.525*** | ||
| (52.471) | (57.422) | (45.668) | (45.275) | |||
| MedHHInc | -7.827*** | -9.410*** | -7.539*** | -8.645*** | -4.584*** | |
| (1.451) | (1.528) | (1.251) | (1.294) | (1.164) | ||
| pctWhite | 12,467.680*** | 6,967.233** | ||||
| (4,065.926) | (2,886.996) | |||||
| pctBlack | 22,126.360*** | 13,438.520*** | ||||
| (6,677.178) | (3,927.429) | |||||
| pctHis | 8,266.271** | 5,494.729** | 605.473 | |||
| (3,931.581) | (2,424.217) | (751.950) | ||||
| pctBlackorHis | 1,473.638 | 4,242.732* | ||||
| (3,050.742) | (2,445.037) | |||||
| pctBachelors | -137,797.300*** | -139,740.100*** | -92,711.170*** | -108,829.400*** | -72,509.500*** | |
| (34,187.930) | (37,923.630) | (29,191.060) | (28,608.070) | (22,955.770) | ||
| elevatorYes | 1,459,061.000*** | 1,976,763.000*** | 1,659,684.000*** | 1,662,770.000*** | 1,581,587.000*** | |
| (177,745.200) | (184,267.700) | (172,869.400) | (173,020.000) | (174,485.600) | ||
| dockNo Dock | -938,449.800*** | -1,024,027.000*** | -977,745.300*** | -929,487.400*** | -926,879.800*** | |
| (90,502.210) | (94,386.810) | (89,406.460) | (88,570.450) | (89,229.720) | ||
| BeachViewYes | -87,089.150 | 196,961.700 | 244,590.800 | |||
| (151,595.400) | (172,970.800) | (161,821.600) | ||||
| poolPool | 1,660.539 | -279,862.100*** | -305,848.900*** | -121,022.900** | -132,705.700*** | -115,987.400** |
| (47,763.740) | (52,041.330) | (57,455.040) | (49,349.150) | (49,158.020) | (49,838.530) | |
| pctOwnerHH | 4,370.392* | 9,791.028*** | 5,523.733*** | 7,565.442*** | 6,574.484*** | |
| (2,420.360) | (2,839.093) | (1,850.178) | (1,407.104) | (1,345.469) | ||
| pctRenterHH | ||||||
| crime_nn5 | -31.101** | 3.317 | -14.777 | |||
| (14.112) | (16.844) | (12.749) | ||||
| milecoastYes | -45,641.080 | -61,817.680 | ||||
| (64,424.120) | (73,425.550) | |||||
| pctNoMortgage | -2,867.908 | 2,464.264 | 11,267.500 | |||
| (9,708.868) | (10,851.580) | (8,546.168) | ||||
| pctOwnerSince2010 | -5,777.683 | 4,609.115 | ||||
| (4,458.552) | (5,023.466) | |||||
| pctCarCommute | -23,557.440*** | -23,196.010*** | -14,759.380*** | -14,440.820*** | -14,239.530*** | |
| (3,998.039) | (4,590.258) | (2,845.830) | (2,717.118) | (2,390.630) | ||
| pctPubCommute | -11,823.920* | -7,224.647 | ||||
| (6,115.165) | (7,335.817) | |||||
| pctPoverty | 6,090.940** | 157.046 | ||||
| (2,810.809) | (3,425.889) | |||||
| Halfmile_metroNon_Halfmile_metro | -42,914.660 | -78,563.660 | 22,804.340 | 56,329.800 | ||
| (55,350.630) | (66,698.300) | (51,315.830) | (51,761.810) | |||
| metro_nn1 | 11.786*** | |||||
| (4.101) | ||||||
| sexualoffenders_Buffer | -20,107.980 | -24,402.190 | ||||
| (25,869.850) | (21,208.070) | |||||
| CoastDist | <<<<<<< HEAD 8,803.452 ======= 8,803.453 >>>>>>> 30a48337056764284380a80a0436d5403662727e | -28,645.250 | ||||
| (25,191.570) | (32,654.660) | |||||
| pctHH200kOrMore | 54,814.180*** | 59,847.630*** | 68,705.370*** | 73,923.860*** | 43,001.600*** | |
| (13,328.200) | (15,142.700) | (11,418.110) | (12,070.070) | (9,838.541) | ||
| luxuryYes | 3,943,744.000*** | 3,914,686.000*** | 3,884,152.000*** | 3,899,354.000*** | 3,956,162.000*** | |
| (176,617.800) | (181,844.500) | (170,471.500) | (170,702.800) | (172,911.900) | ||
| singlefamilyYes | -243,110.700*** | -548,427.800*** | -304,755.200*** | -293,070.000*** | -322,687.400*** | |
| (34,594.600) | (47,211.070) | (39,753.950) | (38,447.110) | (39,797.510) | ||
| estatesYes | 3,432,949.000*** | 3,266,344.000*** | 3,324,030.000*** | 3,465,026.000*** | ||
| (203,207.400) | (200,976.900) | (201,159.200) | (202,098.500) | |||
| Constant | 504,740.800*** | 1,732,718.000*** | -1,199,084.000** | -2,734.609 | 742,843.400*** | 1,353,779.000*** |
| (112,369.400) | (404,062.900) | (598,149.600) | (331,626.900) | (256,129.200) | (204,321.500) | |
| Observations | 2,627 | 2,626 | 2,626 | 2,626 | 2,626 | 2,627 |
| R2 | 0.845 | 0.837 | 0.803 | 0.857 | 0.857 | 0.852 |
| Adjusted R2 | 0.844 | 0.835 | 0.801 | 0.856 | 0.856 | 0.851 |
| Residual Std. Error | 808,061.300 (df = 2614) | 829,004.200 (df = 2599) | 910,113.100 (df = 2598) | 774,919.900 (df = 2603) | 775,610.700 (df = 2603) | 789,882.400 (df = 2606) |
| F Statistic | 1,187.840*** (df = 12; 2614) | 512.839*** (df = 26; 2599) | 393.389*** (df = 27; 2598) | 710.520*** (df = 22; 2603) | 709.044*** (df = 22; 2603) | 752.372*** (df = 20; 2606) |
| Note: | p<0.1; p<0.05; p<0.01 | |||||
Comparing the models we have created in the table above, we have decided that reg6, despite having a marginally smaller p-value than reg4, has the variables that we believe are critical for price prediction.
=======Comparing the models we have created in the table above, we have decided that reg6, despite having a marginally smaller p-value, has the variables that we believe are critical for price prediction.
>>>>>>> 30a48337056764284380a80a0436d5403662727e| Estimate | Std. Error | t value | |
|---|---|---|---|
| (Intercept) | 1353779 | 204321 | 6.626 |
| LotSize | 47.94 | 5.112 | 9.377 |
| Bed | -162279 | 26046 | -6.231 |
| Bath | 105905 | 27458 | 3.857 |
| Age | -1724 | 656 | -2.629 |
| ActualSqFt | 567.2 | 21.64 | 26.2 |
| parks_nn1 | 112.1 | 19.7 | 5.693 |
| MedHHInc | -4.584 | 1.164 | -3.94 |
| pctHis | 605.5 | 752 | 0.8052 |
| pctBachelors | -72510 | 22956 | -3.159 |
| poolPool | -115987 | 49839 | -2.327 |
| singlefamilyYes | -322687 | 39798 | -8.108 |
| pctOwnerHH | 6574 | 1345 | 4.886 |
| pctCarCommute | -14240 | 2391 | -5.956 |
| pctHH200kOrMore | 43002 | 9839 | 4.371 |
| Halfmile_metroNon_Halfmile_metro | 56330 | 51762 | 1.088 |
| luxuryYes | 3956162 | 172912 | 22.88 |
| elevatorYes | 1581587 | 174486 | 9.064 |
| dockNo Dock | -926880 | 89230 | -10.39 |
| sexualoffenders_Buffer | -24402 | 21208 | -1.151 |
| estatesYes | 3465026 | 202099 | 17.15 |
| Pr(>|t|) | ||
|---|---|---|
| (Intercept) | <<<<<<< HEAD4.183e-11 | |
| LotSize | 1.428e-20 | |
| Bed | 5.403e-10 | =======0.00000000004183 |
| LotSize | 0.00000000000000000001428 | |
| Bed | 0.0000000005403 | >>>>>>> 30a48337056764284380a80a0436d5403662727e|
| Bath | 0.0001176 | |
| Age | 0.008626 | |
| ActualSqFt | <<<<<<< HEAD1.516e-134 | |
| parks_nn1 | 1.39e-08 | |
| MedHHInc | 8.365e-05 | =======0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001516 |
| parks_nn1 | 0.0000000139 | |
| MedHHInc | 0.00008365 | >>>>>>> 30a48337056764284380a80a0436d5403662727e|
| pctHis | 0.4208 | |
| pctBachelors | 0.001603 | |
| poolPool | 0.02003 | |
| singlefamilyYes | <<<<<<< HEAD7.818e-16 | |
| pctOwnerHH | 1.089e-06 | |
| pctCarCommute | 2.926e-09 | |
| pctHH200kOrMore | 1.287e-05 | =======0.0000000000000007818 |
| pctOwnerHH | 0.000001089 | |
| pctCarCommute | 0.000000002926 | |
| pctHH200kOrMore | 0.00001287 | >>>>>>> 30a48337056764284380a80a0436d5403662727e|
| Halfmile_metroNon_Halfmile_metro | 0.2766 | |
| luxuryYes | <<<<<<< HEAD9.911e-106 | |
| elevatorYes | 2.402e-19 | |
| dockNo Dock | 8.547e-25 | =======0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000009911 |
| elevatorYes | 0.0000000000000000002402 | |
| dockNo Dock | 0.0000000000000000000000008547 | >>>>>>> 30a48337056764284380a80a0436d5403662727e|
| sexualoffenders_Buffer | 0.25 | |
| estatesYes | <<<<<<< HEAD1.613e-62 |
To test our selected model, we randomly divided our master dataset, MiamiProperties, into separate training (60%) and test (40%) sets. The table shows the results of model.
# set random seed
set.seed(31337)
# get index for training sample
inTrain <- caret::createDataPartition(
y = paste(MiamiProperties$Neighborhood),
p = .60, list = FALSE)
# split data into training and test
Miami.training <- MiamiProperties[inTrain,]
Miami.test <- MiamiProperties[-inTrain,]
#Regression
reg_final_train <- lm(SalePrice ~ ., data = st_drop_geometry(Miami.training) %>%
dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pool,
singlefamily, pctOwnerHH, pctCarCommute, pctHH200kOrMore, Halfmile_metro, luxury, elevator, dock, sexualoffenders_Buffer, estates))
stargazer(reg_final_train, type = "html", main="Training Set Model Summary Results", caption="Model Summary Results for Training Dataset")| Dependent variable: | |
| SalePrice | |
| LotSize | 28.105*** |
| (6.645) | |
| Bed | -190,599.700*** |
| (32,140.920) | |
| Bath | 35,110.370 |
| (34,273.190) | |
| Age | -1,976.730** |
| (810.014) | |
| ActualSqFt | 657.844*** |
| (25.999) | |
| parks_nn1 | 131.932*** |
| (24.500) | |
| MedHHInc | -4.868*** |
| (1.460) | |
| pctHis | 329.402 |
| (932.747) | |
| pctBachelors | -39,660.310 |
| (27,994.360) | |
| poolPool | -98,341.250 |
| (63,949.590) | |
| singlefamilyYes | -335,280.300*** |
| (49,361.380) | |
| pctOwnerHH | 8,522.047*** |
| (1,664.231) | |
| pctCarCommute | -19,406.760*** |
| (2,940.341) | |
| pctHH200kOrMore | 37,930.350*** |
| (12,245.920) | |
| Halfmile_metroNon_Halfmile_metro | 88,067.790 |
| (63,993.000) | |
| luxuryYes | 3,616,042.000*** |
| (196,018.400) | |
| elevatorYes | 1,154,349.000*** |
| (204,790.700) | |
| dockNo Dock | -919,700.000*** |
| (115,585.000) | |
| sexualoffenders_Buffer | -26,241.090 |
| (26,070.220) | |
| estatesYes | 3,078,753.000*** |
| (239,871.600) | |
| Constant | 1,818,951.000*** |
| (251,193.600) | |
| Observations | 1,608 |
| R2 | 0.868 |
| Adjusted R2 | 0.866 |
| Residual Std. Error | 766,335.500 (df = 1587) |
| F Statistic | 521.846*** (df = 20; 1587) |
| Note: | p<0.1; p<0.05; p<0.01 |
| Training Set Model Summary Results |
| Model Summary Results for Training Dataset |
As an indicator of how well our model is performing, the values below show the mean average error of our model on the training and test set. We do think these error values are high. However, as discussed below, our model has issues predicting higher priced homes, which skews our mean average error. Therefore, we don’t think the mean average error is the best single indicator to validate the fit of our model.
## predicting on new data
reg_final_predict <- predict(reg_final_train, newdata = Miami.test)
## Mean Square Error train and test
rmse.train <- caret::MAE(predict(reg_final_train), Miami.training$SalePrice)
rmse.test <- caret::MAE(reg_final_predict, Miami.test$SalePrice)
cat("Train MAE: ", as.integer(rmse.train), " \n","Test MAE: ", as.integer(rmse.test))## Train MAE: 412089
## Test MAE: 372278
The first plot below shows predicted home prices as a function of observed home prices in Miami for all homes. The bottom figure shows predicted home prices as a function of observed home prices in Miami divided between the test set and the training set. The orange line represents a scenario where the predicted home price perfectly matches the observed home price. The green line represents our model’s prediction results. As the plots show, our model is systematically under predicting home prices (e.g., a home that is actually priced at $1 million is predicted to be approximately 600,000 dollars. The green line stays close to the perfect prediction line (orange) until homes with actual prices around 1 million dollars. At higher prices, our model performs worse.
#Plotting accuracy metrics
preds.train <- data.frame(pred = predict(reg_final_train),
actual = Miami.training$SalePrice,
source = "training data")
preds.test <- data.frame(pred = reg_final_predict,
actual = Miami.test$SalePrice,
source = "testing data")
preds <- rbind(preds.train, preds.test)
grid.arrange(nrow = 2,
ggplot(preds, aes(x = actual, y = pred)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_abline(color = "orange") +
coord_equal() +
theme_bw() +
labs(title = "Comparing predictions to actual values for all homes",
caption="Predicted Home Prices as a Function of Observed Home Prices",
x = "Actual Value",
y = "Predicted Value") +
theme(),
ggplot(preds, aes(x = actual, y = pred, color = source)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_abline(color = "orange") +
coord_equal() +
theme_bw() +
facet_wrap(~source, ncol = 2) +
labs(title = "Comparing predictions to actual values for the test set and the training set",
caption="Predicted Home Prices as a Function of Observed Home Prices for the Test Set and the Training Set",
x = "Actual Value",
y = "Predicted Value") +
theme(
legend.position = "none"))The table below summarizes the mean and standard deviation MAE for our model. In the calculation of RMSE, errors are squared before they are averaged, which gives a relatively higher weight to larger errors. Because our model predicts poorly for highly priced houses, we are not surprised to see RMSE values that are greater than MAE values. In this case, MAE is a better metric to measure the accuracy of our model.
| RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|
| 705970.4 | 0.8365392 | 395359.3 | 445205.2 | 0.1117048 | 144068.9 |
The graphs below shows the results from the 100 folds. THe statistics cluster together which indicates that the model would be generalizable to new data. The relatively normal histogram also confirms that our model would generalize to new data with some errors.
Overall, the mean absolute error for the test set is around 400,000 dollars and the mean absolute percentage error is over 100% (see below table). This indicates that our model does not perform well overall.
| Mean_Absolute_Error | Mean_Absolute_Percentage_Error |
|---|---|
| 402650 | 1.433111 |
The figure below shows a map of residuals for test set. Errors are both positive and negative, indicating that our model is both over and under predicting.
The figures below shows the plot of spatial lag in errors for the test set. Spatial lag tells the average sale price of the 20 nearest neighbors. The plot of spatial lag of error tells us that as home price error increases, the errors or nearby home prices also increase. This correlation tells us that there is a spatial relationship that is unaccounted for in the model. Similarly, the Moran’s I plot provides a secondary visual confirmation that spatial autocorrelation remains.
The table below shows the results of our attempt to account for spatial autocorrelation using a neighborhood feature. However, adding in the neighborhood effect did not greatly improve the results, likely due to a few very expensive homes that our model poorly predicts.
| Regression | SalePrice.AbsError | SalePrice.APE |
|---|---|---|
| Baseline Regression | 402650.0 | 1.433111 |
| Neighborhood Effects | 401716.9 | 1.018834 |
The map shows predicted sale prices for all homes. For the most part, the model predicts the most expensive home prices in Miami Beach and along the coast of the mainland.
secret_data <- MiamiPropertiesPred
secret_preds <- predict(reg6, newdata = secret_data)
output_preds <- data.frame(prediction = secret_preds, Folio = secret_data$Folio, team_name = "Miami Heat Predictors")
#write.csv(output_preds, "Miami Heat Predictors.csv")
known_preds <- predict(reg6, newdata = MiamiProperties)
known_preds_geom<-st_sf(prediction = known_preds, Folio = MiamiProperties$Folio, geometry= MiamiProperties$geometry)
secret_preds_geom <- st_sf(prediction = secret_preds, Folio = secret_data$Folio, geometry= secret_data$geometry)
allpreds<-rbind(known_preds_geom, secret_preds_geom)
ggplot()+
geom_sf(data = all_nhoods_MB, fill = "grey40") +
geom_sf(data = allpreds, aes(colour = q5(prediction)),
show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(allpreds,"prediction"),
name="Quintile\nBreaks") +
labs(title="Predicted Home Prices",
caption="Predicted Home Prices for All Homes") +
mapTheme()The figure below shows the mean absolute percentage error by neighborhood. This map shoes that some neighborhoods have a higher mean absolute percentage error (depicted in orange) than the majority of the neighborhoods (in green).
#map MAPE by neighborhoods
st_drop_geometry(bothRegressions) %>%
group_by(Regression, Neighborhood) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
ungroup() %>%
left_join(all_nhoods) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = mean.MAPE)) +
geom_sf(data = bothRegressions, colour = "black", size = .5) +
facet_wrap(~Regression) +
scale_fill_gradient(low = palette5[1], high = palette5[5],
name = "MAPE") +
labs(title = "Mean test set MAPE by neighborhood",
caption="Mean Absolute Percentage Error by Neighborhood for both the Baseline Regression and Neighborhood Effects") +
mapTheme()As seen in the figure below, there are two neighborhoods (Coral Gate and Flora Park) where the mean absolute percentage error is very high when the mean price is very low. In addition, there are two neighborhoods (Belle Island and San Marco Island) where the mean price is very high while the mean absolute percentage error is very low. For most neighborhoods, the mean price is less than 3,000,000 and the mean absolute percentage error is less than ~4. Overall, the mean absolute percentage errors are higher than is desired.
#Plot MAPE by neighborhoods
st_drop_geometry(Miami.test) %>%
group_by(Neighborhood) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T),
mean.Price=mean(SalePrice, na.rm=T)) %>%
ggplot(aes(mean.Price, mean.MAPE)) +
geom_point(color="blue")+
geom_text(aes(label=ifelse(mean.MAPE>5, Neighborhood,"")), hjust=0.2, vjust=-0.3)+
geom_text(aes(label=ifelse(mean.Price>3000000, Neighborhood, "")),hjust=0.5, vjust=0)+
labs(title = "Mean Absolute Percentage Error as a Function of Mean Price by Neighborhood",
caption="Mean Absolute Percentage Error and Mean Price by Neighborhood") +
plotTheme()As shown in the tables below, this model is not generalizable by race or income. These tables provide the Mean Absolute Percent Error for different income groups and homes located in majority Hispanic census tracts. For homes in majority Hispanic census tracts, the model performs better for majority non-hispanic without neighborhood effects. With respect to income groups, our model performs better for homes located in high income census tracts than for homes located in low income tracts.
#Race Context and Income Context (from book)
tracts18 <-
tracts18%>%
mutate(raceContext = ifelse(pctHis > 50, "Majority Hispanic", "Majority Non-Hispanic"),
incomeContext = ifelse(MedInc > 32322, "High Income", "Low Income"))
grid.arrange(ncol = 2,
ggplot() + geom_sf(data = na.omit(tracts18), aes(fill = raceContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() + geom_sf(data = na.omit(tracts18), aes(fill = incomeContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
labs(title = "Income Context") +
mapTheme() +
theme(legend.position="bottom"))st_join(st_centroid(bothRegressions), tracts18) %>%
group_by(Regression, raceContext) %>%
summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(raceContext, mean.MAPE) %>%
kable(caption = "Table X. Test set MAPE by neighborhood racial context") %>%
kable_styling("striped", full_width = F)| Regression | Majority Hispanic | Majority Non-Hispanic |
|---|---|---|
| Baseline Regression | 178% | 110% |
| Neighborhood Effects | 84% | 119% |
st_join(st_centroid(bothRegressions), tracts18) %>%
filter(!is.na(incomeContext)) %>%
group_by(Regression, incomeContext) %>%
summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = "Table XI. Test set MAPE by neighborhood income context") %>%
kable_styling("striped", full_width = F)| Regression | High Income | Low Income | |
|---|---|---|---|
| Baseline Regression | 44% | 180% | |
| Neighborhood Effects | 35% | 126% | =======0.00000000000000000000000000000000000000000000000000000000000001613 | >>>>>>> 30a48337056764284380a80a0436d5403662727e
Overall, the indicators used to evaluate the fit of the model (e.g., MAE) show that our model is not predicting well and is therefore not effective for use. However, additional evaluations show that the model performs fairly well for all homes less than 10 million dollars. We were unable to effectively engineer features to correctly model prices for very expensive homes. That being said, some of the more interesting variables included in the model were percent of the population with a bachelors degree, presence of an elevator or a dock, and whether the home is single family or multi-family.
The model was able to predict 85% of the variation in price, and most important features were internal property features, including lot size, actual square footage, whether the home is zoned for single family, and whether the home had an elevator or a dock. Generally, we think that our model’s inability to predict expensive homes is the largest source of the error. This poor performance skews our overall MAE. According to the mapped results, we were unable to effectively account for the spatial variation in prices. The model predicted poorly in Miami Beach and along the mainland coast, which is where many of the expensive houses are located.
We would recommend that Zillow make further improvements before employing the model for official use. The model has large errors and does not do a good job of predicting home prices. In order to improve this model, we suggest further exploring features that would better account for the spatial processes at play. The goal of adding these features would be to account for spatial autocorrelation.
=======-Split the ‘toPredict’ == 0 into a separate training and test set. -Provide a polished table of your (training set) lm summary results (coefficients, R2 etc). -Provide a polished table of mean absolute error and MAPE for a single test set. Check out the “kable” function for markdown to create nice tables. -Provide the results of your cross-validation tests. This includes mean and standard deviation MAE. Do 100 folds and plot your cross-validation MAE as a histogram. Is your model generalizable to new data? -Plot predicted prices as a function of observed prices -Provide a map of your residuals for your test set. Include a Moran’s I test and a plot of the spatial lag in errors. 5 -Provide a map of your predicted values for where ‘toPredict’ is both 0 and 1. -Using the test set predictions, provide a map of mean absolute percentage error (MAPE) by neighborhood. -Provide a scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood. -Using tidycensus, split your city in to two groups (perhaps by race or income) and test your model’s generalizability. Is your model generalizable?
To test our selected model, we randomly divided our master dataset, MiamiProperties, into separate training (60%) and test (40%) sets. Table 2
# set random seed
set.seed(31337)
# get index for training sample
inTrain <- caret::createDataPartition(
y = paste(MiamiProperties$Neighborhood),
p = .60, list = FALSE)
# split data into training and test
Miami.training <- MiamiProperties[inTrain,]
Miami.test <- MiamiProperties[-inTrain,]
anyNA(MiamiProperties$estates)[1] FALSE
#Regression
reg_final_train <- lm(SalePrice ~ ., data = st_drop_geometry(Miami.training) %>%
dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pool,
singlefamily, pctOwnerHH, pctCarCommute, pctHH200kOrMore, Halfmile_metro, luxury, elevator, dock, sexualoffenders_Buffer, estates))
stargazer(reg_final_train, type = "html", main="Training Set Model Summary Results", caption="Table 2. Model Summary Results for Training Dataset")| Dependent variable: | |
| SalePrice | |
| LotSize | 28.105*** |
| (6.645) | |
| Bed | -190,599.700*** |
| (32,140.920) | |
| Bath | 35,110.370 |
| (34,273.190) | |
| Age | -1,976.730** |
| (810.014) | |
| ActualSqFt | 657.844*** |
| (25.999) | |
| parks_nn1 | 131.932*** |
| (24.500) | |
| MedHHInc | -4.868*** |
| (1.460) | |
| pctHis | 329.402 |
| (932.747) | |
| pctBachelors | -39,660.310 |
| (27,994.360) | |
| poolPool | -98,341.250 |
| (63,949.590) | |
| singlefamilyYes | -335,280.300*** |
| (49,361.380) | |
| pctOwnerHH | 8,522.047*** |
| (1,664.231) | |
| pctCarCommute | -19,406.760*** |
| (2,940.341) | |
| pctHH200kOrMore | 37,930.350*** |
| (12,245.920) | |
| Halfmile_metroNon_Halfmile_metro | 88,067.790 |
| (63,993.000) | |
| luxuryYes | 3,616,042.000*** |
| (196,018.400) | |
| elevatorYes | 1,154,349.000*** |
| (204,790.700) | |
| dockNo Dock | -919,700.000*** |
| (115,585.000) | |
| sexualoffenders_Buffer | -26,241.090 |
| (26,070.220) | |
| estatesYes | 3,078,753.000*** |
| (239,871.600) | |
| Constant | 1,818,951.000*** |
| (251,193.600) | |
| Observations | 1,608 |
| R2 | 0.868 |
| Adjusted R2 | 0.866 |
| Residual Std. Error | 766,335.500 (df = 1587) |
| F Statistic | 521.846*** (df = 20; 1587) |
| Note: | p<0.1; p<0.05; p<0.01 |
| Training Set Model Summary Results |
| Table 2. Model Summary Results for Training Dataset |
## predicting on new data
reg_final_predict <- predict(reg_final_train, newdata = Miami.test)
## Mean Square Error train and test
rmse.train <- caret::MAE(predict(reg_final_train), Miami.training$SalePrice)
rmse.test <- caret::MAE(reg_final_predict, Miami.test$SalePrice)
cat("Train MAE: ", as.integer(rmse.train), " \n","Test MAE: ", as.integer(rmse.test))## Train MAE: 412089
## Test MAE: 372278
#Plotting accuracy metrics
preds.train <- data.frame(pred = predict(reg_final_train),
actual = Miami.training$SalePrice,
source = "training data")
preds.test <- data.frame(pred = reg_final_predict,
actual = Miami.test$SalePrice,
source = "testing data")
preds <- rbind(preds.train, preds.test)
ggplot(preds, aes(x = pred, y = actual, color = source)) +
geom_point() +
geom_smooth(method = "lm", color = "green") +
geom_abline(color = "orange") +
coord_equal() +
theme_bw() +
facet_wrap(~source, ncol = 2) +
labs(title = "Comparing predictions to actual values",
x = "Predicted Value",
y = "Actual Value") +
theme(
legend.position = "none"
)##### PREDICTION #####
secret_data <- MiamiPropertiesPred
secret_preds <- predict(reg6, newdata = secret_data)
output_preds <- data.frame(prediction = secret_preds, Folio = secret_data$Folio, team_name = "Miami Heat Predictors")
write.csv(output_preds, "Miami Heat Predictors.csv")########### CROSS- VALIDATION ##########
#Generalizability - cross validation
fitControl <- trainControl(method = "cv", number = 100, savePredictions = TRUE)
set.seed(825)
reg.cv <-
train(SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>%
dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pool,
singlefamily, pctOwnerHH, pctCarCommute, pctHH200kOrMore, Halfmile_metro, luxury, elevator, dock, sexualoffenders_Buffer, estates),
method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv## Linear Regression
##
## 2627 samples
## 20 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 2601, 2600, 2602, 2600, 2601, 2601, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 705970.4 0.8365392 395359.3
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## RMSE Rsquared MAE Resample
## 1 773505.1 0.9765727 430198.9 Fold001
## 2 1336408.5 0.9180646 606279.2 Fold002
## 3 671136.5 0.9310164 393067.8 Fold003
## 4 672522.1 0.7739194 461455.7 Fold004
## 5 689466.6 0.7679178 400598.4 Fold005
## 6 948758.9 0.9745954 402110.2 Fold006
## 7 359388.1 0.6679438 285979.5 Fold007
## 8 312685.9 0.8556234 259063.1 Fold008
## 9 928025.0 0.7386786 474204.7 Fold009
## 10 1898782.1 0.9621317 885762.2 Fold010
## 11 546840.6 0.9500958 377649.4 Fold011
## 12 1079773.8 0.8012237 470954.1 Fold012
## 13 450627.0 0.7802116 291624.2 Fold013
## 14 396249.0 0.8227055 293931.6 Fold014
## 15 493257.0 0.7859733 336333.4 Fold015
## 16 350777.0 0.7351069 217379.3 Fold016
## 17 1126474.4 0.7738343 596149.0 Fold017
## 18 1035253.3 0.7841888 518705.5 Fold018
## 19 374594.2 0.7749765 300680.4 Fold019
## 20 435979.8 0.7241656 337576.9 Fold020
## 21 781308.5 0.8780393 326649.6 Fold021
## 22 623979.9 0.6819755 413006.7 Fold022
## 23 1504376.2 0.4682456 699978.0 Fold023
## 24 458731.0 0.8120372 383649.0 Fold024
## 25 470262.3 0.9876497 325502.6 Fold025
## 26 868910.2 0.8633297 403368.2 Fold026
## 27 1232477.1 0.8114825 485249.3 Fold027
## 28 304735.0 0.8947379 253863.9 Fold028
## 29 393765.0 0.9114925 294384.3 Fold029
## 30 372335.7 0.8670785 274610.8 Fold030
## 31 390110.3 0.8353459 310797.0 Fold031
## 32 1013279.3 0.8741693 514069.9 Fold032
## 33 463564.3 0.8881520 368094.9 Fold033
## 34 567097.3 0.8572772 341291.2 Fold034
## 35 355287.4 0.5290788 266347.2 Fold035
## 36 667973.2 0.7645775 372203.0 Fold036
## 37 2794545.1 0.9073163 838659.8 Fold037
## 38 1168442.7 0.9336798 509841.3 Fold038
## 39 808275.0 0.7841398 436260.4 Fold039
## 40 521672.8 0.7042793 320989.6 Fold040
## 41 595353.3 0.9455842 388787.8 Fold041
## 42 1291256.7 0.9269569 608567.2 Fold042
## 43 311471.9 0.9158634 231478.8 Fold043
## 44 310502.9 0.7882913 239804.7 Fold044
## 45 533492.1 0.8354165 374576.4 Fold045
## 46 558253.3 0.9696126 405957.4 Fold046
## 47 407164.8 0.7050802 317756.0 Fold047
## 48 410460.3 0.9642930 332013.8 Fold048
## 49 887922.1 0.8231950 472054.4 Fold049
## 50 672804.3 0.6192428 353980.6 Fold050
## 51 522826.8 0.7537659 382985.8 Fold051
## 52 473851.1 0.6159535 339209.6 Fold052
## 53 642771.4 0.9557383 453192.7 Fold053
## 54 564022.9 0.8035761 407895.6 Fold054
## 55 1069713.7 0.8456283 604164.9 Fold055
## 56 367785.7 0.9909964 261109.3 Fold056
## 57 479899.9 0.9405142 304032.3 Fold057
## 58 255378.8 0.8288930 196113.9 Fold058
## 59 1044710.7 0.9295741 574472.1 Fold059
## 60 317281.1 0.7540499 247147.4 Fold060
## 61 367356.2 0.8410152 302212.4 Fold061
## 62 709895.2 0.9373799 422890.5 Fold062
## 63 1123032.0 0.8613488 424309.3 Fold063
## 64 734524.8 0.7659228 448788.5 Fold064
## 65 299213.5 0.8635155 237928.8 Fold065
## 66 334058.9 0.7258302 237451.4 Fold066
## 67 989911.1 0.9924853 421428.9 Fold067
## 68 542856.4 0.9900273 336286.1 Fold068
## 69 984634.9 0.9545788 517953.0 Fold069
## 70 809832.1 0.9761899 355268.4 Fold070
## 71 287613.0 0.8020381 230585.3 Fold071
## 72 603176.2 0.9160568 345114.9 Fold072
## 73 404871.7 0.8523867 314251.1 Fold073
## 74 841628.6 0.5399965 401023.9 Fold074
## 75 837971.4 0.9451447 527560.9 Fold075
## 76 838734.0 0.7042508 490796.2 Fold076
## 77 664316.9 0.8791849 335175.8 Fold077
## 78 347920.2 0.9263979 255828.4 Fold078
## 79 388867.1 0.7643258 265600.3 Fold079
## 80 567665.1 0.9610869 418659.4 Fold080
## 81 683677.2 0.9800604 398450.1 Fold081
## 82 736524.5 0.6317219 459411.3 Fold082
## 83 346396.2 0.8541700 269203.6 Fold083
## 84 605961.9 0.8515606 382929.4 Fold084
## 85 1886622.7 0.8638038 925501.6 Fold085
## 86 1051618.3 0.9418650 582945.1 Fold086
## 87 311014.7 0.9724418 263592.2 Fold087
## 88 730867.4 0.7781796 410413.5 Fold088
## 89 2763407.2 0.7963621 890907.4 Fold089
## 90 650293.8 0.8911023 326229.0 Fold090
## 91 1235530.0 0.9189961 537204.1 Fold091
## 92 468250.3 0.9033640 299455.3 Fold092
## 93 751792.9 0.7827611 399809.8 Fold093
## 94 496544.5 0.7432410 297671.9 Fold094
## 95 438598.2 0.8611757 300805.3 Fold095
## 96 374291.7 0.8758935 239814.5 Fold096
## 97 483600.2 0.8948108 342130.8 Fold097
## 98 673718.8 0.8858796 369229.2 Fold098
## 99 490757.6 0.5483328 350915.7 Fold099
## 100 380939.3 0.8137839 226376.2 Fold100
reg.cv$resample %>%
pivot_longer(-Resample) %>%
mutate(name = as.factor(name)) %>%
ggplot(., aes(x = name, y = value, color = name)) +
geom_jitter(width = 0.1) +
facet_wrap(~name, ncol = 3, scales = "free") +
theme_bw() +
theme(
legend.position = "none"
)## [1] 2627
## [1] 2627
#Create dataset with "out of fold" predictions and original data
map_preds <- MiamiProperties %>%
rowid_to_column(var = "rowIndex") %>%
left_join(cv_preds, by = "rowIndex") %>%
mutate(SalePrice.AbsError = abs(pred - SalePrice)) %>%
cbind(st_coordinates(st_centroid((.))))
st_crs(map_preds) <- st_crs(all_nhoods)
# plot errors on a map
ggplot() +
geom_sf(data = all_nhoods_MB, fill = "grey40") +
geom_sf(data = map_preds, aes(colour = q5(SalePrice.AbsError)),
show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(map_preds,"SalePrice.AbsError"),
name="Quintile\nBreaks") +
labs(title="Absolute sale price errors on the OOF set",
subtitle = "OOF = 'Out Of Fold'") +
mapTheme()#Start Ch4
k_nearest_neighbors = 20
#prices
coords <- st_coordinates(st_centroid(MiamiProperties))
# k nearest neighbors
neighborList <- knn2nb(knearneigh(coords, k_nearest_neighbors))
spatialWeights <- nb2listw(neighborList, style="W")
MiamiProperties$lagPrice <- lag.listw(spatialWeights, MiamiProperties$SalePrice)
#errors
Miami.test <-
Miami.test %>%
mutate(Regression = "Baseline Regression",
SalePrice.Predict = ifelse((predict(reg_final_train, Miami.test)) < 0, mean(Miami.training$SalePrice), predict(reg_final_train, Miami.test)),
SalePrice.Error = SalePrice.Predict - SalePrice,
SalePrice.AbsError = abs(SalePrice.Predict - SalePrice),
SalePrice.APE = (abs(SalePrice.Predict - SalePrice)) / abs(SalePrice.Predict))%>%
mutate(SalePrice.AbsError = replace_na(SalePrice.AbsError, 0))%>%
mutate(SalePrice.Error=replace_na(SalePrice.Error, 0))
coords.test <- st_coordinates(st_centroid(Miami.test))
neighborList.test <- knn2nb(knearneigh(coords.test, k_nearest_neighbors))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
Miami.test$lagPriceError <- lag.listw(spatialWeights.test, Miami.test$SalePrice.AbsError)
summary(Miami.test$SalePrice.AbsError)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1333 100462 240320 402650 487901 13444949
##lag of price
Miami.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.AbsError))## Simple feature collection with 1019 features and 88 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 880977.1 ymin: 500856.3 xmax: 944453.6 ymax: 559657.2
## projected CRS: NAD_1983_StatePlane_Florida_East_FIPS_0901_Feet
## First 10 features:
## Folio SalePrice saleYear Property.Address Property.City
## 1 131320040710 355000 2018 3811 NW 12 TER Miami
## 2 141290210170 393800 2020 4001 WOODRIDGE RD Miami
## 3 131320041310 304500 2018 951 NW 39 CT Miami
## 4 232270170570 2150000 2018 3738 PINE TREE DR Miami Beach
## 5 131340140090 340000 2018 2051 NW 18 ST Miami
## 6 131330050230 330000 2019 1554 NW 29 CT Miami
## 7 131260390541 260000 2019 1253 NW 33 ST Miami
## 8 141150240220 1700000 2018 1900 SECOFFEE ST Miami
## 9 232220080020 940000 2019 4585 NAUTILUS DR Miami Beach
## 10 232100121170 432500 2018 7115 TROUVILLE ESPLANADE Miami Beach
## Property.Zip County.Taxable City.Taxable AdjustedSqFt LotSize Bed Bath
## 1 33126-2618 290388 290388 1717 6780 2 1
## 2 33133-6617 932499 932499 1506 9380 2 2
## 3 33126-3621 192321 192321 983 8760 2 1
## 4 33140-3936 1748919 1748919 3798 17700 3 3
## 5 33125-1411 254935 254935 2056 8400 4 2
## 6 33125-2036 240596 240596 1654 9420 2 1
## 7 33142-6150 200414 200414 1336 6027 2 1
## 8 33133-3211 1420637 1420637 2572 20000 4 3
## 9 33140-2827 798332 798332 1827 10240 3 3
## 10 33141-3515 312603 312603 1662 6500 2 2
## Stories Units EffectiveYearBuilt LivingSqFt ActualSqFt X toPredict
## 1 1 1 1997 1548 2429 <NA> 0
## 2 1 1 2018 1583 1583 <NA> 0
## 3 1 1 1980 871 1182 <NA> 0
## 4 2 1 1995 3647 4286 <NA> 0
## 5 1 1 1975 1803 2535 <NA> 0
## 6 1 1 1949 1152 2664 <NA> 0
## 7 1 1 1930 1336 1336 <NA> 0
## 8 1 1 1950 2572 2572 <NA> 0
## 9 1 1 1960 1827 1827 <NA> 0
## 10 1 1 1947 1655 1675 <NA> 0
## pool singlefamily estates dock Age 0 luxury elevator Gazebo BeachView
## 1 No Pool Yes No No Dock 23 0 No No No No
## 2 No Pool Yes No No Dock 2 0 No No No No
## 3 No Pool Yes No No Dock 40 0 No No No No
## 4 Pool No Yes No Dock 25 0 No No No No
## 5 No Pool No No No Dock 45 0 No No No No
## 6 No Pool No No No Dock 71 0 No No No No
## 7 No Pool No No No Dock 90 0 No No No No
## 8 No Pool Yes No No Dock 70 0 No No No No
## 9 Pool Yes No No Dock 60 0 No No No No
## 10 No Pool Yes No No Dock 73 0 No No No No
## CoastDist milecoast halfmilecoast bars_nn1 bars_nn2 bars_nn3 bars_nn4
## 1 2.962703488 No No 1579.023 1728.056 1783.866 1818.521
## 2 0.338089977 Yes Yes 7429.355 7431.387 7510.776 7562.259
## 3 2.968909961 No No 1114.699 1170.437 1191.080 1212.170
## 4 0.008728162 Yes Yes 1060.072 1208.241 1391.619 1690.378
## 5 1.508557718 No No 5869.254 6483.301 6725.936 6886.482
## 6 2.190778631 No No 3440.014 4040.681 4702.859 5148.298
## 7 1.290936092 No No 4905.154 5330.119 5491.495 5605.220
## 8 0.162915663 Yes Yes 2016.365 2701.140 3035.818 3205.692
## 9 0.122984722 Yes Yes 3240.267 3798.052 4319.211 4600.420
## 10 0.254416641 Yes Yes 1928.127 1928.497 1943.482 1952.119
## bars_nn5 bars_Buffer sexualoffenders_Buffer crime_nn1 crime_nn2 crime_nn3
## 1 1842.905 0 0 1370.7684 1568.139 1761.061
## 2 7606.466 0 0 5278.8628 5596.144 5702.356
## 3 1226.014 0 0 1652.0567 1946.055 2067.804
## 4 1870.726 0 0 7094.2441 7377.017 7485.720
## 5 6983.101 0 1 316.1319 863.202 1219.966
## 6 5548.213 0 0 1406.6022 1406.602 1456.863
## 7 5682.011 0 0 845.1109 1337.243 1508.471
## 8 3411.116 0 0 2121.7092 2973.869 3265.542
## 9 5015.358 0 0 6225.9584 6989.676 8095.403
## 10 2134.229 0 0 1653.4457 3353.656 3930.982
## crime_nn4 crime_nn5 parks_nn1 parks_nn2 parks_nn3 parks_nn4 parks_nn5
## 1 1896.454 2067.695 1876.0041 2029.165 2958.666 3837.817 4431.129
## 2 7042.636 7915.235 840.0082 1283.240 1441.323 1600.715 2264.571
## 3 2249.676 2516.068 2956.0521 3036.696 3314.580 4009.695 4617.866
## 4 7612.477 7779.277 1869.1895 2109.965 2217.044 2328.252 2406.506
## 5 1481.557 1643.815 849.8401 1329.159 1751.722 2070.070 2262.667
## 6 1543.299 1657.059 1625.3104 2359.249 2805.104 3127.282 3424.477
## 7 1605.037 1702.643 2704.4825 2949.092 3122.917 3379.478 3567.166
## 8 3631.172 4024.269 1564.4762 1910.319 2258.438 2922.009 3464.650
## 9 8651.327 8994.762 1962.5849 1966.289 1971.318 2368.780 2764.235
## 10 4299.127 4680.126 325.0006 1076.707 1646.523 1993.162 2216.839
## parks.Buffer metro_nn1 Halfmile_metro elem_name
## 1 0 4735.6939 Non_Halfmile_metro Kinloch Park Elementary
## 2 0 6566.7689 Non_Halfmile_metro Coconut Grove Elementary
## 3 0 5708.2272 Non_Halfmile_metro Kinloch Park Elementary
## 4 0 21477.2030 Non_Halfmile_metro North Beach Elementary
## 5 0 4695.1523 Non_Halfmile_metro Comstock Elementary
## 6 0 6247.3365 Non_Halfmile_metro Kensington Park Elementary
## 7 0 575.2278 Halfmile_metro Santa Clara Elementary
## 8 0 4291.2755 Non_Halfmile_metro Coconut Grove Elementary
## 9 0 20948.7032 Non_Halfmile_metro North Beach Elementary
## 10 1 28446.8243 Non_Halfmile_metro Treasure Island Elementary
## Neighborhood MiamiBeach GEOID TotalPop MedHHInc MedRent
## 1 West Grapeland Heights No 12086490100 1881 38250 845
## 2 South Grove No 12086007300 4037 191250 1900
## 3 West Grapeland Heights No 12086490100 1881 38250 845
## 4 MIAMI BEACH Yes 12086004000 5629 92181 1450
## 5 Curtis Park No 12086003003 6339 25521 888
## 6 North Grapeland Heights No 12086005001 7763 39704 1156
## 7 Santa Clara No 12086002501 5880 27540 901
## 8 East Grove No 12086006802 3373 127900 0
## 9 MIAMI BEACH Yes 12086004000 5629 92181 1450
## 10 MIAMI BEACH Yes 12086003915 5799 49662 1258
## MedInc TotalHH OwnerHH RenterHH HHNoMort Own2017later Own201516 Own201014
## 1 20761 1866 1239 627 157 0 5 45
## 2 91949 3962 3416 546 465 33 138 304
## 3 20761 1866 1239 627 157 0 5 45
## 4 46659 5422 3637 1785 640 18 78 223
## 5 14949 6255 863 5392 98 0 0 30
## 6 16315 7630 4493 3137 559 36 91 218
## 7 17383 5759 1637 4122 192 32 12 78
## 8 74966 3308 2573 735 539 55 141 285
## 9 46659 5422 3637 1785 640 18 78 223
## 10 24668 5756 2333 3423 342 16 130 102
## HH200k pctWhite pctBlack pctHis pctBlackorHis pctBachelors pctPoverty
## 1 8 91.86603 3.6682616 94.09888 97.76715 0.4253057 19.191919
## 2 733 88.33292 1.4862522 41.93708 43.42333 0.3467922 11.072579
## 3 8 91.86603 3.6682616 94.09888 97.76715 0.4253057 19.191919
## 4 438 91.33061 0.8527269 28.69071 29.54344 1.5988630 8.296323
## 5 16 86.65405 9.6229689 91.08692 100.70989 0.3312825 39.391071
## 6 0 93.48190 1.4427412 97.42368 98.86642 1.3525699 16.617287
## 7 23 61.30952 23.1802721 83.35034 106.53061 0.0000000 42.329932
## 8 507 96.70916 0.3854136 45.33057 45.71598 1.0672991 9.546398
## 9 438 91.33061 0.8527269 28.69071 29.54344 1.5988630 8.296323
## 10 86 74.58182 1.7589240 71.75375 73.51267 2.3624763 19.951716
## pctCarCommute pctPubCommute pctOwnerHH pctRenterHH pctNoMortgage
## 1 88.09284 3.6326942 66.39871 33.60129 8.413719
## 2 79.76756 0.1584786 86.21908 13.78092 11.736497
## 3 88.09284 3.6326942 66.39871 33.60129 8.413719
## 4 61.01340 2.2194305 67.07857 32.92143 11.803762
## 5 80.27330 12.8241065 13.79696 86.20304 1.566747
## 6 92.67944 3.9748356 58.88598 41.11402 7.326343
## 7 78.38672 13.1995369 28.42507 71.57493 3.333912
## 8 79.54393 0.4024145 77.78114 22.21886 16.293833
## 9 61.01340 2.2194305 67.07857 32.92143 11.803762
## 10 76.03942 5.0816138 40.53162 59.46838 5.941626
## pctOwnerSince2010 pctHH200kOrMore year geometry
## 1 4.035513 0.4287245 2018 POINT (899958.5 528010.9)
## 2 13.905152 18.5007572 2018 POINT (900622.5 501244.8)
## 3 4.035513 0.4287245 2018 POINT (899458.4 527050.3)
## 4 8.770965 8.0781999 2018 POINT (943195.3 538198.7)
## 5 3.476246 0.2557954 2018 POINT (909818 530781)
## 6 7.678611 0.0000000 2018 POINT (905278 529658.8)
## 7 7.452657 0.3993749 2018 POINT (913864.7 536521.7)
## 8 18.694131 15.3264813 2018 POINT (910913.1 512757.6)
## 9 8.770965 8.0781999 2018 POINT (940728.9 541686.6)
## 10 10.630090 1.4940931 2018 POINT (939845.8 553167.4)
## Regression SalePrice.Predict SalePrice.Error SalePrice.AbsError
## 1 Baseline Regression 996882.7 641882.72 641882.72
## 2 Baseline Regression 710422.4 316622.38 316622.38
## 3 Baseline Regression 341087.3 36587.28 36587.28
## 4 Baseline Regression 6206694.5 4056694.49 4056694.49
## 5 Baseline Regression 657842.9 317842.93 317842.93
## 6 Baseline Regression 1220939.5 890939.53 890939.53
## 7 Baseline Regression 409873.8 149873.78 149873.78
## 8 Baseline Regression 1367543.3 -332456.70 332456.70
## 9 Baseline Regression 908499.1 -31500.89 31500.89
## 10 Baseline Regression 138788.3 -293711.74 293711.74
## SalePrice.APE lagPriceError
## 1 0.64388991 218062.4
## 2 0.44568188 403176.0
## 3 0.10726662 255348.4
## 4 0.65359983 566141.7
## 5 0.48315930 236394.2
## 6 0.72971634 248975.2
## 7 0.36565837 270265.6
## 8 0.24310506 465061.7
## 9 0.03467355 580959.9
## 10 2.11625785 523939.3
ggplot(MiamiProperties, aes(x=lagPrice, y=SalePrice)) +
geom_point(colour = "#FA7800") +
geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
labs(title = "Price as a function of the spatial lag of price",
caption = "Public Policy Analytics, Figure 6.6",
x = "Spatial lag of price (Mean price of 5 nearest neighbors)",
y = "Sale Price") +
plotTheme()ggplot(Miami.test, aes(x=lagPriceError, y=SalePrice)) +
geom_point(colour = "#FA7800") +
geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
labs(title = "Error as a function of the spatial lag of price",
caption = "",
x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
y = "Sale Price") +
plotTheme()#Moran's I - A measure of spatial correlation
moranTest <- moran.mc(Miami.test$SalePrice.AbsError,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count",
caption="Miami") +
plotTheme()# this plot shows that there is spatial correlation
## Provide a map of your predicted values for where ‘toPredict’ is both 0 and 1.
#Errors by group - Neighborhood, Elementary School
##Neighborhood
nhood_sum <- Miami.test %>%
group_by(Neighborhood) %>%
summarize(meanPrice = mean(SalePrice, na.rm = T),
meanPrediction = mean(SalePrice.Predict, na.rm = T),
meanMAE = mean(SalePrice.AbsError, na.rm = T))
nhood_sum %>%
st_drop_geometry %>%
arrange(desc(meanMAE)) %>%
kable() %>% kable_styling()| Neighborhood | meanPrice | meanPrediction | meanMAE |
|---|---|---|---|
| San Marco Island | 11695000.0 | 6963342.3 | 4731657.75 |
| Bay Heights | 1398000.0 | 2361979.1 | 963979.06 |
| MIAMI BEACH | 2886936.7 | 2679476.7 | 915020.01 |
| South Grove Bayside | 2390550.0 | 2883448.7 | 685183.49 |
| Highland Park | 850000.0 | 188341.7 | 661658.31 |
| East Grove | 1918371.4 | 2203690.6 | 637156.80 |
| Belle Island | 5182500.0 | 5650601.9 | 631155.06 |
| Grove Center | 2475000.0 | 2468669.8 | 513263.87 |
| Little River Central | 192600.0 | 701855.9 | 509255.93 |
| Fair Isle | 1974166.7 | 1879352.1 | 507144.71 |
| West Grove | 481400.0 | 885745.0 | 482841.31 |
| Bird Grove East | 655000.0 | 425994.3 | 480232.37 |
| Northwestern Estates | 147083.3 | 512498.1 | 431776.49 |
| Edgewater | 545000.0 | 152375.0 | 392624.96 |
| Baypoint | 2160000.0 | 2153078.5 | 386170.26 |
| South Grove | 1195517.6 | 1368252.9 | 379028.41 |
| Bird Grove West | 475000.0 | 651115.5 | 376337.81 |
| Citrus Grove | 308194.7 | 588761.9 | 375295.85 |
| Morningside | 1016852.6 | 1051312.3 | 367481.78 |
| King Heights | 223418.2 | 486216.1 | 364455.32 |
| Flora Park | 187231.6 | 453878.6 | 356821.75 |
| Melrose | 408671.4 | 595390.0 | 335058.81 |
| Hadley Park | 204852.9 | 410665.9 | 328153.04 |
| North Grapeland Heights | 290588.2 | 555891.3 | 328063.67 |
| Brentwood | 332500.0 | 659603.8 | 327103.77 |
| Shenandoah South | 458308.1 | 602663.8 | 318187.64 |
| Belle Meade | 959200.0 | 1035626.4 | 312414.03 |
| Belle Meade West | 391250.0 | 699228.0 | 307977.99 |
| Shorecrest | 547500.0 | 650718.4 | 293542.36 |
| Curtis Park | 284000.0 | 558372.5 | 293398.22 |
| Flagami | 339740.9 | 500982.3 | 289283.12 |
| Orchard Villa | 218727.3 | 415495.6 | 284769.38 |
| Auburndale | 316500.0 | 545627.9 | 274538.68 |
| North Grove | 1136470.6 | 1042797.3 | 269367.36 |
| East Little Havana | 424566.7 | 653691.7 | 266036.30 |
| Parkdale South | 382500.0 | 452603.7 | 264816.50 |
| Liberty Square | 146168.8 | 361483.7 | 260392.62 |
| Edison | 221136.0 | 419123.2 | 254546.34 |
| North Sewell Park | 411250.0 | 348701.8 | 249281.22 |
| Douglas Park | 350333.3 | 474881.6 | 248661.23 |
| Roads | 655037.1 | 672794.5 | 248431.51 |
| Shenandoah North | 476894.7 | 437459.1 | 246745.05 |
| Santa Clara | 247547.1 | 460572.9 | 238849.32 |
| South Grapeland Heights | 296812.5 | 479782.7 | 236458.26 |
| La Pastorita | 372500.0 | 140718.7 | 231781.31 |
| Spring Garden | 1460000.0 | 1237770.8 | 222229.15 |
| Bayside | 595307.7 | 756383.3 | 217689.54 |
| Coral Gate | 472100.0 | 342632.9 | 217514.25 |
| Buena Vista West | 320305.3 | 459212.7 | 207779.01 |
| Silver Bluff | 501716.2 | 522366.8 | 207169.64 |
| Old San Juan | 495175.0 | 514327.2 | 194075.24 |
| West Grapeland Heights | 344400.0 | 473866.9 | 182249.76 |
| Lemon City/Little Haiti | 271187.5 | 405862.0 | 172148.37 |
| South Sewell Park | 384142.9 | 286857.3 | 170530.07 |
| Miami Avenue | 1461450.0 | 1443793.8 | 158016.28 |
| Parkdale North | 420983.3 | 282265.7 | 138717.64 |
| Latin Quarter | 240000.0 | 104701.5 | 135298.49 |
| Northeast Overtown | 370000.0 | 501690.1 | 131690.13 |
| Orange Bowl | 250000.0 | 136304.5 | 113695.48 |
| Allapattah Industrial District | 220000.0 | 119212.2 | 100787.83 |
| Historic Buena Vista East | 600000.0 | 685393.3 | 85393.29 |
| Buena Vista Heights | 390171.4 | 420124.9 | 74480.80 |
| Civic Center | 245000.0 | 312966.4 | 67966.43 |
| Palm Grove | 740000.0 | 701222.1 | 38777.86 |
map_preds_sum <- map_preds %>%
group_by(Neighborhood) %>%
summarise(meanMAE = mean(SalePrice.AbsError))
ggplot() +
geom_sf(data = all_nhoods %>%
left_join(st_drop_geometry(map_preds_sum), by = "Neighborhood"),
aes(fill = q5(meanMAE))) +
scale_fill_manual(values = palette5,
labels=qBr(nhood_sum,"meanMAE"),
name="Quintile\nBreaks") +
mapTheme() +
labs(title="Absolute sale price errors on the OOF set by Neighborhood")reg.nhood <- lm(SalePrice ~ ., data = as.data.frame(Miami.training) %>%
dplyr::select(Neighborhood, SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pool,
singlefamily, pctOwnerHH, pctCarCommute, pctHH200kOrMore, Halfmile_metro, luxury, elevator, dock, sexualoffenders_Buffer, estates))
Miami.test.nhood <-
Miami.test %>%
mutate(Regression = "Neighborhood Effects",
SalePrice.Predict = ifelse((predict(reg.nhood, Miami.test)) < 0, mean(Miami.training$SalePrice), predict(reg.nhood, Miami.test)),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / abs(SalePrice)) %>%
mutate(SalePrice.AbsError = replace_na(SalePrice.AbsError, 0))%>%
mutate(SalePrice.Error=replace_na(SalePrice.Error, 0))
bothRegressions <-
rbind(
dplyr::select(Miami.test, starts_with("SalePrice"), Regression, Neighborhood) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)),
dplyr::select(Miami.test.nhood, starts_with("SalePrice"), Regression, Neighborhood) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)))
st_drop_geometry(bothRegressions) %>%
gather(Variable, Value, -Regression, -Neighborhood) %>%
filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
group_by(Regression, Variable) %>%
summarize(meanValue = mean(Value, na.rm = T)) %>%
spread(Variable, meanValue) %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "#25CB10") %>%
row_spec(2, color = "black", background = "#FA7800")| Regression | SalePrice.AbsError | SalePrice.APE |
|---|---|---|
| Baseline Regression | 402650.0 | 1.433111 |
| Neighborhood Effects | 401716.9 | 1.018834 |
bothRegressions %>%
dplyr::select(SalePrice.Predict, SalePrice, Regression) %>%
ggplot(aes(SalePrice, SalePrice.Predict)) +
geom_point() +
stat_smooth(aes(SalePrice, SalePrice),
method = "lm", se = FALSE, size = 1, colour="#FA7800") +
stat_smooth(aes(SalePrice.Predict, SalePrice),
method = "lm", se = FALSE, size = 1, colour="#25CB10") +
facet_wrap(~Regression) +
labs(title="Predicted sale price as a function of observed price",
subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
plotTheme() + theme(plot.title = element_text(size = 18, colour = "black")) #map MAPE by neighborhoods
st_drop_geometry(bothRegressions) %>%
group_by(Regression, Neighborhood) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
ungroup() %>%
left_join(all_nhoods) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = mean.MAPE)) +
geom_sf(data = bothRegressions, colour = "black", size = .5) +
facet_wrap(~Regression) +
scale_fill_gradient(low = palette5[1], high = palette5[5],
name = "MAPE") +
labs(title = "Mean test set MAPE by neighborhood") +
mapTheme()#Race Context and Income Context (from book)
tracts18 <-
tracts18%>%
mutate(raceContext = ifelse(pctWhite > .75, "Majority White", "Majority Non-White"),
incomeContext = ifelse(MedInc > 32322, "High Income", "Low Income"))
grid.arrange(ncol = 2,
ggplot() + geom_sf(data = na.omit(tracts18), aes(fill = raceContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() + geom_sf(data = na.omit(tracts18), aes(fill = incomeContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
labs(title = "Income Context") +
mapTheme() + theme(legend.position="bottom"))st_join(st_centroid(bothRegressions), tracts18) %>%
group_by(Regression, raceContext) %>%
summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(raceContext, mean.MAPE) %>%
kable(caption = "Test set MAPE by neighborhood racial context") %>%
kable_styling("striped", full_width = F)| Regression | Majority White |
|---|---|
| Baseline Regression | 143% |
| Neighborhood Effects | 102% |
st_join(st_centroid(bothRegressions), tracts18) %>%
filter(!is.na(incomeContext)) %>%
group_by(Regression, incomeContext) %>%
summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = "Test set MAPE by neighborhood income context") %>%
kable_styling("striped", full_width = F)| Regression | High Income | Low Income |
|---|---|---|
| Baseline Regression | 44% | 180% |
| Neighborhood Effects | 35% | 126% |
##Elementary School
elem_sum <- Miami.test %>%
group_by(elem_name) %>%
summarize(meanPrice = mean(SalePrice, na.rm = T),
meanPrediction = mean(SalePrice.Predict, na.rm = T),
meanMAE = mean(SalePrice.AbsError, na.rm = T))
elem_sum %>%
st_drop_geometry %>%
arrange(desc(meanMAE)) %>%
kable() %>% kable_styling()| elem_name | meanPrice | meanPrediction | meanMAE |
|---|---|---|---|
| South Pointe Elementary | 10262857.1 | 6064669.6 | 4288048.6 |
| North Beach Elementary | 3201125.0 | 3100833.2 | 948887.1 |
| Carver, George W. Elementary | 337380.0 | 791175.2 | 642186.3 |
| Treasure Island Elementary | 1346104.8 | 1457182.6 | 537725.4 |
| Coconut Grove Elementary | 1436635.2 | 1622857.7 | 462832.9 |
| Flagami Elementary | 336687.5 | 607747.9 | 432380.2 |
| Pharr, Kelsey L. Elementary | 268777.8 | 649581.5 | 410085.7 |
| Holmes Elementary | 145750.0 | 470684.7 | 389572.7 |
| Biscayne Elementary | 1486871.4 | 1516256.4 | 376600.2 |
| Liberty City Elementary | 136400.0 | 501354.2 | 371631.5 |
| Olinda Elementary | 181742.9 | 431463.3 | 360742.2 |
| McCrary, Jr. Jesse J. Elementary | 396714.3 | 650585.1 | 360201.6 |
| Dunbar, Paul L. Elementary | 572960.0 | 548432.6 | 329027.6 |
| Morningside K-8 Center | 1132992.8 | 1232758.2 | 324238.7 |
| Citrus Grove Elementary | 326012.0 | 455099.2 | 323790.4 |
| Orchard Villa Elementary | 223422.2 | 425251.9 | 317732.1 |
| Flagler, Henry M. Elementary | 320257.6 | 447547.1 | 311170.5 |
| Smith, Lenora Braynon. Elementary | 208989.4 | 386916.7 | 310679.3 |
| Miller, Phyllis R. Elementary | 568531.2 | 666815.5 | 302626.0 |
| Kensington Park Elementary | 295544.1 | 520664.9 | 293127.6 |
| Tucker, Frances S. Elementary | 363310.5 | 530961.7 | 282689.9 |
| Fairlawn Elementary | 337500.0 | 519064.7 | 280641.2 |
| Fienberg/Fisher K-8 Center | 1015833.3 | 1051658.4 | 279053.4 |
| Comstock Elementary | 275000.0 | 509934.4 | 277320.7 |
| Shenandoah Elementary | 445589.7 | 501830.8 | 263744.4 |
| Wheatley, Phillis Elementary | 457500.0 | 327032.6 | 262157.5 |
| Angelou, Maya Elementary | 280116.7 | 451369.5 | 261926.7 |
| Coral Gables Preparatory Academy | 442769.2 | 405544.2 | 259532.2 |
| Coral Way K-8 Center | 613216.2 | 649422.5 | 255870.7 |
| L’ouverture, Toussaint Elementary | 303672.0 | 505200.8 | 253107.1 |
| Auburndale Elementary | 371518.2 | 517834.2 | 248927.9 |
| Riverside Elementary | 398200.0 | 575264.6 | 247359.5 |
| Silver Bluff Elementary | 488725.5 | 495055.1 | 224921.0 |
| Kinloch Park Elementary | 355796.8 | 452824.4 | 216684.9 |
| Santa Clara Elementary | 239350.0 | 402655.0 | 189138.3 |
| Edison Park K-8 Center | 199585.7 | 385114.6 | 185528.9 |
| Hartner, Eneida M. Elementary | 455728.6 | 527093.3 | 181258.4 |
| Shadowlawn Elementary | 297396.3 | 385640.2 | 158815.6 |
| Melrose Elementary | 327000.0 | 173527.0 | 153473.0 |
| Drew, Charles R. K-8 Center | 152050.0 | 185972.7 | 114101.0 |
map_preds_sum_elem <- map_preds %>%
group_by(elem_name) %>%
summarise(meanMAE = mean(SalePrice.AbsError))
ggplot() +
geom_sf(data = elem_map %>%
left_join(st_drop_geometry(map_preds_sum_elem), by = "elem_name"),
aes(fill = q5(meanMAE))) +
scale_fill_manual(values = palette5,
labels=qBr(nhood_sum,"meanMAE"),
name="Quintile\nBreaks") +
mapTheme() +
labs(title="Absolute sale price errors on the OOF set by Elementary School District")reg.elem <- lm(SalePrice ~ ., data = as.data.frame(Miami.training) %>%
dplyr::select(elem_name, SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pool,
singlefamily, pctOwnerHH, pctCarCommute, pctHH200kOrMore, Halfmile_metro, luxury, elevator, dock, sexualoffenders_Buffer, estates))
Miami.test.elem <-
Miami.test %>%
mutate(Regression = "Elementary School Effects",
SalePrice.Predict = predict(reg.elem, Miami.test),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / abs(SalePrice)) %>%
mutate(SalePrice.AbsError = replace_na(SalePrice.AbsError, 0))%>%
mutate(SalePrice.Error=replace_na(SalePrice.Error, 0))
threeRegressions <-
rbind(
dplyr::select(Miami.test, starts_with("SalePrice"), Regression, Neighborhood, elem_name) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)),
dplyr::select(Miami.test.nhood, starts_with("SalePrice"), Regression, Neighborhood, elem_name) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)),
dplyr::select(Miami.test.elem, starts_with("SalePrice"), Regression, Neighborhood, elem_name) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)))
st_drop_geometry(threeRegressions) %>%
gather(Variable, Value, -Regression, -Neighborhood,-elem_name) %>%
filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
group_by(Regression, Variable) %>%
summarize(meanValue = mean(Value, na.rm = T)) %>%
spread(Variable, meanValue) %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "#25CB10") %>%
row_spec(2, color = "black", background = "#FA7800") %>%
row_spec(3, color="black", background = "#0082FA")| Regression | SalePrice.AbsError | SalePrice.APE |
|---|---|---|
| Baseline Regression | 402650.0 | 1.4331108 |
| Elementary School Effects | 383564.0 | 0.7952907 |
| Neighborhood Effects | 401716.9 | 1.0188337 |
threeRegressions %>%
dplyr::select(SalePrice.Predict, SalePrice, Regression) %>%
ggplot(aes(SalePrice, SalePrice.Predict)) +
geom_point() +
stat_smooth(aes(SalePrice, SalePrice),
method = "lm", se = FALSE, size = 1, colour="#FA7800") +
stat_smooth(aes(SalePrice.Predict, SalePrice),
method = "lm", se = FALSE, size = 1, colour="#25CB10") +
facet_wrap(~Regression) +
labs(title="Predicted sale price as a function of observed price",
subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
plotTheme() + theme(plot.title = element_text(size = 18, colour = "black")) #map MAPE by neighborhoods (need to figure out how to add the elementary school basemap in here)
st_drop_geometry(threeRegressions) %>%
group_by(Regression, Neighborhood, elem_name) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
ungroup() %>%
left_join(all_nhoods) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = mean.MAPE)) +
geom_sf(data = bothRegressions, colour = "black", size = .5) +
facet_wrap(~Regression) +
scale_fill_gradient(low = palette5[1], high = palette5[5],
name = "MAPE") +
labs(title = "Mean test set MAPE by neighborhood") +
mapTheme()#Race Context and Income Context (from book)
st_join(st_centroid(threeRegressions), tracts18) %>%
group_by(Regression, raceContext) %>%
summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(raceContext, mean.MAPE) %>%
kable(caption = "Test set MAPE by neighborhood racial context")| Regression | Majority White |
|---|---|
| Baseline Regression | 143% |
| Elementary School Effects | 80% |
| Neighborhood Effects | 102% |
st_join(st_centroid(threeRegressions), tracts18) %>%
filter(!is.na(incomeContext)) %>%
group_by(Regression, incomeContext) %>%
summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = "Test set MAPE by neighborhood income context")| Regression | High Income | Low Income |
|---|---|---|
| Baseline Regression | 44% | 180% |
| Elementary School Effects | 44% | 93% |
| Neighborhood Effects | 35% | 126% |
<<<<<<< HEAD # 6. Predicitons
======= # 6. Conclusion
>>>>>>> 30a48337056764284380a80a0436d5403662727eZillow. 2020. “Zestimate.” https://www.zillow.com/zestimate/↩︎